1. Names and Honor Code

Derek Christensen

Honor code (“K-State Honor Pledge: "On my honor, as a student, I have neither given nor received unauthorized aid on this academic work.")

Import Needed libraries

In [250]:
# Import Needed libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from patsy import dmatrices

#import decisiontreeclassifier
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
#import randomforest classifier
from sklearn.ensemble import RandomForestClassifier
#import logisticregression classifier
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
#import knn classifier
from sklearn.neighbors import KNeighborsClassifier

#for validating your classification model
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score
from sklearn import metrics
from sklearn.metrics import roc_curve, auc

# feature selection
from sklearn.feature_selection import RFE
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.grid_search import GridSearchCV

from IPython.display import Image
from IPython.core.display import HTML

pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000) 
height has been deprecated.

In [377]:
# input title slide here
Image('Images/christensen_finalprj/Slide1.png')
Out[377]:
In [378]:
# input outline slide here
Image('Images/christensen_finalprj/Slide2.png')
Out[378]:
In [379]:
# input problem description & introduction slide here
Image('Images/christensen_finalprj/Slide3.png')
Out[379]:

2. Business Understanding & Data Understanding

Determine Business Objectives:

  • Background: Hospitals are penalized for patients that are re-admitted less than 30 days after they are released.
  • Business Objectives: To reduce or eliminate the number of patients re-admitted less than 30 days after they are released.
  • Success Criteria: Identification of factors that increse the likelihood of a patient returning within 30 days.
  • Business Value: The average cost in 2011 for a hospital stay was $10,000.*
  • *http://www.beckershospitalreview.com/finance/11-statistics-on-average-hospital-costs-per-stay.html
In [380]:
# embed full problem description web image
Image(url= "https://72022e22-a-62cb3a1a-s-sites.googlegroups.com/site/christensenfinalprj/full-problem-description/christensen_finalprj-full-prob-descrp.png?attachauth=ANoY7coojyQ211rGpA2ygjc39hleeQ1Z-4R5R-B88BLBFtH1jAq2nT7l4JsRJfQhXwLYsBJ-HR_9ZFkGPGhybCqgeeBS4QVGujnYQae0dHKQCi6coswid7_6nzzAJ-riEv5DfvoEe1sls3dzBnvabtMJMesqIkRfqQSISQBI-Bdpp1ZveQ__SDPGMfWoW4kK1rmIropOOyy_2QQXBqRlq1hCp7cN6UPYzTlp54LBVweAnkfNexPQsIZDh90sH3xMlJGsDWOoDZHAHlB16u69fTPBSXvO57EqRNRejBKcU2bYYujzMRFjMp0%3D&attredirects=0")
Out[380]:
In [381]:
# input key findings & insights slide here
Image('Images/christensen_finalprj/Slide4.png')
Out[381]:
In [382]:
# input key final analysis & recommendation slide here
Image('Images/christensen_finalprj/Slide5.png')
Out[382]:
In [383]:
# input next steps slide here
Image('Images/christensen_finalprj/Slide6.png')
Out[383]:

Next Steps Ideas

  • Analyse those close to the 30 day threshold - i.e. 31 to 45-60 days
  • Weight Data
  • Cross referencing between the 3 Diagnosis'
  • Analyzing the Order of the 3 Diagnosis'
  • Add more Diagnosis
  • More Granular in the Diagnosis
  • ?
In [384]:
# input dataset slide here
Image('Images/christensen_finalprj/Slide7.png')
Out[384]:

3. Data Identification & Collection

Determine Business Objectives:

  • Description: The dataset contains over 56,000 HIPPA compliant de-identified records of hospital admissions.
  • Source: Hack K-State 2016 : Data Science For Social Good - https://zslie.github.io/
  • Details: There are 50 columns, of which is the Visit ID and Patient ID, along with 48 factors.
  • Factors: The factors have varying number of attributes, ranging from 1 to 715, so there are ~5.27x10^41 solutions.
  • Factors: Descriptions below.
In [259]:
#embed factor descriptions 'fd'
fd = pd.read_excel('data/factor-definitions.xlsx')
fd
Out[259]:
Column Name Column Value Type Description and values
0 encounter_id Encounter ID Numeric Unique identifier of an encounter
1 patient_nbr Patient number Numeric Unique identifier of a patient
2 race Race Nominal Values: Caucasian, Asian, African American, Hi...
3 gender Gender Nominal Values: male, female, and unknown/invalid
4 age Age Nominal Grouped in 10-year intervals: [0, 10), [10, 20...
5 weight Weight Numeric Weight in pounds.
6 admission_type_id Admission Type Nominal Integer identifier corresponding to 9 distinct...
7 discharge_disposition_id Discharge Disposition Nominal Integer identifier corresponding to 29 distinc...
8 admission_source_id Admission Source Nominal Integer identifier corresponding to 21 distinc...
9 time_in_hospital Time in Hospital Numeric Integer number of days between admission and d...
10 payer_code Payer Code Nominal Integer identifier corresponding to 23 distinc...
11 medical_specialty Medical Specialty Nominal Integer identifier of a specialty of the admit...
12 num_lab_procedures Number of lab procedures Numeric Number of lab tests performed during the encou...
13 num_procedures Number of procedures Numeric Number of procedures (other than lab tests) pe...
14 num_medications Number of medications Numeric Number of distinct generic names administered ...
15 number_outpatient Number of outpatient visits Numeric Number of outpatient visits of the patient in ...
16 number_emergency Number of emergency visits Numeric Number of emergency visits of the patient in t...
17 number_inpatient Number of inpatient visits Numeric Number of inpatient visits of the patient in t...
18 diag_1 Diagnosis 1 Nominal The primary diagnosis (coded as first three di...
19 diag_2 Diagnosis 2 Nominal Secondary diagnosis (coded as first three digi...
20 diag_3 Diagnosis 3 Nominal Additional secondary diagnosis (coded as first...
21 number_diagnoses Number of diagnoses Numeric Number of diagnoses entered to the system
22 max_glu_serum Glucose serum test result Nominal Indicates the range of the result or if the te...
23 A1Cresult A1c test result Nominal Indicates the range of the result or if the te...
24 24 features for medications 24 features for medications Nominal For the generic names: metformin, repaglinide,...
25 change Change of medications Nominal Indicates if there was a change in diabetic me...
26 diabetesMed Diabetes medications Nominal Indicates if there was any diabetic medication...
27 readmitted Readmitted Nominal Days to inpatient readmission. Values: “<30” i...

Performed some data manipulation directly in excel, including:

  • Changed 'medical_specialy' to 'MED_SPEC_NUM'
  • Changed the 3 <string/int> 'diag_x's to 'DIAG_CAT_X'S & converted 858 unique diagnosis' into 33 Diagnosis Categories
  • Notes are in Challenge_1_Training_Data_Conversion.xlsx file on the "Storage" page
In [260]:
#import patient data
df = pd.read_csv('data/Challenge_1_Training_Work_Clean.csv')
df.head(5)
Out[260]:
encounter_id patient_nbr race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code medical_specialty MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient diag_1 DIAG_CAT_1 diag_2 DIAG_CAT_2 diag_3 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
0 22915332 1475073 Caucasian Female [80-90) ? 3 1 4 5 ? ? 0 39 3 11 0 0 0 414 10 289 4 593 18 7 None None No No No No No No No Steady No No No No No No No No No No No No No No No No Yes >30
1 158361324 93771396 Caucasian Female [70-80) ? 5 3 1 6 MC ? 0 79 1 25 3 0 0 518 16 428 13 496 16 9 None None Steady No No No No No No Steady No No No No No No No No No Up No No No No No Ch Yes NO
2 120453192 24581277 Other Female [60-70) ? 1 22 7 4 SP InternalMedicine 18 29 2 18 0 0 1 820 24 599 18 191 2 9 None None No No No No No No No No No No No No No No No No No Steady No No No No No No Yes NO
3 25590894 5041395 Caucasian Male [70-80) ? 1 1 7 3 ? InternalMedicine 18 72 3 18 0 0 0 537 17 280 4 250.41 3 9 None None No No No No No No No No No No No No No No No No No Steady No No No No No No Yes >30
4 154290822 49027563 Caucasian Female [30-40) ? 2 1 1 3 ? ? 0 21 1 6 0 0 0 790 23 599 18 V42 32 9 None None No No No No No No No No No No No No No No No No No No No No No No No No No NO
In [261]:
# Number of unique values in each column
df.apply(pd.Series.nunique)
Out[261]:
encounter_id                56000
patient_nbr                 44369
race                            6
gender                          2
age                            10
weight                         10
admission_type_id               8
discharge_disposition_id       26
admission_source_id            17
time_in_hospital               14
payer_code                     17
medical_specialty              64
MED_SPEC_NUM                   64
num_lab_procedures            114
num_procedures                  7
num_medications                73
number_outpatient              32
number_emergency               25
number_inpatient               21
diag_1                        661
DIAG_CAT_1                     31
diag_2                        668
DIAG_CAT_2                     30
diag_3                        715
DIAG_CAT_3                     30
number_diagnoses               16
max_glu_serum                   4
A1Cresult                       4
metformin                       4
repaglinide                     4
nateglinide                     4
chlorpropamide                  4
glimepiride                     4
acetohexamide                   2
glipizide                       4
glyburide                       4
tolbutamide                     2
pioglitazone                    4
rosiglitazone                   4
acarbose                        4
miglitol                        4
troglitazone                    2
tolazamide                      2
examide                         1
citoglipton                     1
insulin                         4
glyburide.metformin             4
glipizide.metformin             2
glimepiride.pioglitazone        1
metformin.rosiglitazone         2
metformin.pioglitazone          2
change                          2
diabetesMed                     2
readmitted                      3
dtype: int64
In [262]:
#show the information about the data'
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56000 entries, 0 to 55999
Data columns (total 54 columns):
encounter_id                56000 non-null int64
patient_nbr                 56000 non-null int64
race                        56000 non-null object
gender                      56000 non-null object
age                         56000 non-null object
weight                      56000 non-null object
admission_type_id           56000 non-null int64
discharge_disposition_id    56000 non-null int64
admission_source_id         56000 non-null int64
time_in_hospital            56000 non-null int64
payer_code                  56000 non-null object
medical_specialty           56000 non-null object
MED_SPEC_NUM                56000 non-null int64
num_lab_procedures          56000 non-null int64
num_procedures              56000 non-null int64
num_medications             56000 non-null int64
number_outpatient           56000 non-null int64
number_emergency            56000 non-null int64
number_inpatient            56000 non-null int64
diag_1                      56000 non-null object
DIAG_CAT_1                  56000 non-null int64
diag_2                      56000 non-null object
DIAG_CAT_2                  56000 non-null int64
diag_3                      56000 non-null object
DIAG_CAT_3                  56000 non-null int64
number_diagnoses            56000 non-null int64
max_glu_serum               56000 non-null object
A1Cresult                   56000 non-null object
metformin                   56000 non-null object
repaglinide                 56000 non-null object
nateglinide                 56000 non-null object
chlorpropamide              56000 non-null object
glimepiride                 56000 non-null object
acetohexamide               56000 non-null object
glipizide                   56000 non-null object
glyburide                   56000 non-null object
tolbutamide                 56000 non-null object
pioglitazone                56000 non-null object
rosiglitazone               56000 non-null object
acarbose                    56000 non-null object
miglitol                    56000 non-null object
troglitazone                56000 non-null object
tolazamide                  56000 non-null object
examide                     56000 non-null object
citoglipton                 56000 non-null object
insulin                     56000 non-null object
glyburide.metformin         56000 non-null object
glipizide.metformin         56000 non-null object
glimepiride.pioglitazone    56000 non-null object
metformin.rosiglitazone     56000 non-null object
metformin.pioglitazone      56000 non-null object
change                      56000 non-null object
diabetesMed                 56000 non-null object
readmitted                  56000 non-null object
dtypes: int64(17), object(37)
memory usage: 23.1+ MB
In [263]:
#describe the column readmitted only (e.g., count, unique, frequency)
df['readmitted'].describe()
Out[263]:
count     56000
unique        3
top          NO
freq      30238
Name: readmitted, dtype: object
In [264]:
#distribution of 0 and 1 in the readmitted column
df.groupby('readmitted').count()
Out[264]:
encounter_id patient_nbr race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code medical_specialty MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient diag_1 DIAG_CAT_1 diag_2 DIAG_CAT_2 diag_3 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed
readmitted
<30 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285
>30 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477
NO 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238
In [265]:
#replace the values of the 'readmitted' column:
# NO = 0
# >30 = 1
# <30 = 2

df = df.replace({'readmitted': {'NO': 0, '>30': 1, '<30': 2}})

df.head(2)
Out[265]:
encounter_id patient_nbr race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code medical_specialty MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient diag_1 DIAG_CAT_1 diag_2 DIAG_CAT_2 diag_3 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
0 22915332 1475073 Caucasian Female [80-90) ? 3 1 4 5 ? ? 0 39 3 11 0 0 0 414 10 289 4 593 18 7 None None No No No No No No No Steady No No No No No No No No No No No No No No No No Yes 1
1 158361324 93771396 Caucasian Female [70-80) ? 5 3 1 6 MC ? 0 79 1 25 3 0 0 518 16 428 13 496 16 9 None None Steady No No No No No No Steady No No No No No No No No No Up No No No No No Ch Yes 0

Key Business Data Question Summary

  • Of 56,000 hospital visits in this dB:
    • 6,285 were re-admitted < 30 days - these are the instances that need solved for
    • 19,477 were also re-admitted, but after the 30 day threshold
    • 30,238 were not re-admitted - there could be some insight also gleaned from why they DID'T have to be re-admitted
In [385]:
# input ETL slide here
Image('Images/christensen_finalprj/Slide8.png')
Out[385]:

Data understanding & processing (ETL)

In [267]:
#drop or remove the columns 'encounter_id', 'patient_nbr' since this column is not used in the analysis and disply the result
df = df.drop('encounter_id', axis=1)
df = df.drop('patient_nbr', axis=1)
df = df.drop('medical_specialty', axis=1)

# drop or remove the columns 'diag_1', 'diag_2' and 'diag_3' since these values of been put into catergories
# in columns 'DIAG_CAT_1', 'DIAG_CAT_2' and 'DIAG_CAT_3'
df = df.drop('diag_1', axis=1)
df = df.drop('diag_2', axis=1)
df = df.drop('diag_3', axis=1)

df.head(5)
Out[267]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
0 Caucasian Female [80-90) ? 3 1 4 5 ? 0 39 3 11 0 0 0 10 4 18 7 None None No No No No No No No Steady No No No No No No No No No No No No No No No No Yes 1
1 Caucasian Female [70-80) ? 5 3 1 6 MC 0 79 1 25 3 0 0 16 13 16 9 None None Steady No No No No No No Steady No No No No No No No No No Up No No No No No Ch Yes 0
2 Other Female [60-70) ? 1 22 7 4 SP 18 29 2 18 0 0 1 24 18 2 9 None None No No No No No No No No No No No No No No No No No Steady No No No No No No Yes 0
3 Caucasian Male [70-80) ? 1 1 7 3 ? 18 72 3 18 0 0 0 17 4 3 9 None None No No No No No No No No No No No No No No No No No Steady No No No No No No Yes 1
4 Caucasian Female [30-40) ? 2 1 1 3 ? 0 21 1 6 0 0 0 23 18 32 9 None None No No No No No No No No No No No No No No No No No No No No No No No No No 0
In [268]:
#distribution of races in the race column
df.groupby('race').count()
Out[268]:
gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
race
? 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215
AfricanAmerican 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563 10563
Asian 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356 356
Caucasian 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886 41886
Hispanic 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117 1117
Other 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863 863
In [269]:
#replace the values of the 'race' column:
# ? = 0
# AfricanAmerican = 1
# Asian = 2
# Caucasion = 3
# Hispanic = 4
# Other = 5

df = df.replace({'race': {'?': 0, 'AfricanAmerican': 1, 'Asian': 2,'Caucasian': 3,'Hispanic': 4,'Other': 5}})

df.head(2)
Out[269]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
0 3 Female [80-90) ? 3 1 4 5 ? 0 39 3 11 0 0 0 10 4 18 7 None None No No No No No No No Steady No No No No No No No No No No No No No No No No Yes 1
1 3 Female [70-80) ? 5 3 1 6 MC 0 79 1 25 3 0 0 16 13 16 9 None None Steady No No No No No No Steady No No No No No No No No No Up No No No No No Ch Yes 0
In [270]:
#distribution of genders in the gender column
df.groupby('gender').count()
Out[270]:
race age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
gender
Female 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990 29990
Male 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010 26010
In [271]:
#replace the values of the 'gender' column:
# Female = 0
# Male = 1

df = df.replace({'gender': {'Male': 1, 'Female': 0}})

df.head(2)
Out[271]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
0 3 0 [80-90) ? 3 1 4 5 ? 0 39 3 11 0 0 0 10 4 18 7 None None No No No No No No No Steady No No No No No No No No No No No No No No No No Yes 1
1 3 0 [70-80) ? 5 3 1 6 MC 0 79 1 25 3 0 0 16 13 16 9 None None Steady No No No No No No Steady No No No No No No No No No Up No No No No No Ch Yes 0
In [272]:
#distribution of decade age categories in the age column
df.groupby('age').count()
Out[272]:
race gender weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
age
[0-10) 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98
[10-20) 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355 355
[20-30) 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934 934
[30-40) 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070 2070
[40-50) 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237 5237
[50-60) 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578 9578
[60-70) 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422 12422
[70-80) 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356 14356
[80-90) 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436 9436
[90-100) 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514 1514
In [273]:
#replace the values of the 'age' column:
# [0-10) = 0
# [10-20) = 1
# [20-30) = 2
# [30-40) = 3
# [40-50) = 4
# [50-60) = 5
# [60-70) = 6
# [70-80) = 7
# [80-90) = 8
# [90-100) = 9

df = df.replace({'age': {'[0-10)': 0, '[10-20)': 1, '[20-30)': 2, '[30-40)': 3, '[40-50)': 4, '[50-60)': 5, '[60-70)': 6, '[70-80)': 7, '[80-90)': 8, '[90-100)': 9}})

df.head(2)
Out[273]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
0 3 0 8 ? 3 1 4 5 ? 0 39 3 11 0 0 0 10 4 18 7 None None No No No No No No No Steady No No No No No No No No No No No No No No No No Yes 1
1 3 0 7 ? 5 3 1 6 MC 0 79 1 25 3 0 0 16 13 16 9 None None Steady No No No No No No Steady No No No No No No No No No Up No No No No No Ch Yes 0
In [274]:
#distribution of weight categories in the weight column
df.groupby('weight').count()
Out[274]:
race gender age admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
weight
>200 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
? 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238 54238
[0-25) 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27
[100-125) 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349 349
[125-150) 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67
[150-175) 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19
[175-200) 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
[25-50) 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
[50-75) 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488 488
[75-100) 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745 745
In [275]:
#replace the values of the 'weight' column:
# ? = 0
# [0-25) = 1
# [25-50) = 2
# [50-75) = 3
# [75-100) = 4
# [100-125) = 5
# [125-150) = 6
# [150-175) = 7
# [175-200) = 8
# > 200 = 9

df = df.replace({'weight': {'?': 0, '[0-25)': 1, '[25-50)': 2, '[50-75)': 3, '[75-100)': 4, '[100-125)': 5, '[125-150)': 6, '[150-175)': 7, '[175-200)': 8, '>200': 9}})

df.head(2)
Out[275]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
0 3 0 8 0 3 1 4 5 ? 0 39 3 11 0 0 0 10 4 18 7 None None No No No No No No No Steady No No No No No No No No No No No No No No No No Yes 1
1 3 0 7 0 5 3 1 6 MC 0 79 1 25 3 0 0 16 13 16 9 None None Steady No No No No No No Steady No No No No No No No No No Up No No No No No Ch Yes 0
In [276]:
#distribution of pay types in the payer_code column
df.groupby('payer_code').count()
Out[276]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
payer_code
? 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153 22153
BC 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550 2550
CH 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81
CM 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044
CP 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405 1405
DM 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309 309
HM 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506 3506
MC 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855 17855
MD 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977
MP 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46
OG 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564 564
OT 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44
PO 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312 312
SI 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32
SP 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759 2759
UN 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293 1293
WC 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70
In [277]:
#replace the values of the 'payer_code' column:

df = df.replace({'payer_code': {'?': 0, 'BC': 1, 'CH': 2, 'CM': 3, 'CP': 4, 'DM': 5, 'HM': 6, 'MC': 7, 'MD': 8, 'MP': 9, 'OG': 10, 'OT': 11, 'PO': 12, 'SI': 13, 'SP': 14, 'UN': 15, 'WC': 16}})

df.head(2)
Out[277]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
0 3 0 8 0 3 1 4 5 0 0 39 3 11 0 0 0 10 4 18 7 None None No No No No No No No Steady No No No No No No No No No No No No No No No No Yes 1
1 3 0 7 0 5 3 1 6 7 0 79 1 25 3 0 0 16 13 16 9 None None Steady No No No No No No Steady No No No No No No No No No Up No No No No No Ch Yes 0
In [278]:
#distribution of categories in the max_glu_serum column
df.groupby('max_glu_serum').count()
Out[278]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
max_glu_serum
>200 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800
>300 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692 692
None 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027 53027
Norm 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481 1481
In [279]:
#replace the values of the 'max_glu_serum' column:
# None = 0
# Norm = 1
# >200 = 2
# >300 = 3

df = df.replace({'max_glu_serum': {'None': 0, 'Norm': 1, '>200': 2, '>300': 3}})

df.head(2)
Out[279]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
0 3 0 8 0 3 1 4 5 0 0 39 3 11 0 0 0 10 4 18 7 0 None No No No No No No No Steady No No No No No No No No No No No No No No No No Yes 1
1 3 0 7 0 5 3 1 6 7 0 79 1 25 3 0 0 16 13 16 9 0 None Steady No No No No No No Steady No No No No No No No No No Up No No No No No Ch Yes 0
In [280]:
#distribution of categories in the A1Cresult column
df.groupby('A1Cresult').count()
Out[280]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
A1Cresult
>7 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143 2143
>8 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523 4523
None 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560 46560
Norm 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774 2774
In [281]:
#replace the values of the 'A1Cresult' column:
# None = 0
# Norm = 1
# >7 = 2
# >8 = 3

df = df.replace({'A1Cresult': {'None': 0, 'Norm': 1, '>7': 2, '>8': 3}})

df.head(2)
Out[281]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
0 3 0 8 0 3 1 4 5 0 0 39 3 11 0 0 0 10 4 18 7 0 0 No No No No No No No Steady No No No No No No No No No No No No No No No No Yes 1
1 3 0 7 0 5 3 1 6 7 0 79 1 25 3 0 0 16 13 16 9 0 0 Steady No No No No No No Steady No No No No No No No No No Up No No No No No Ch Yes 0
In [282]:
#distribution of Ch or No in the change column
df.groupby('change').count()
Out[282]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone diabetesMed readmitted
change
Ch 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910 25910
No 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090 30090
In [283]:
#replace the values of the 'change' column:
# No = 0
# Ch = 1

df = df.replace({'change': {'No': 0, 'Ch': 1}})

df.head(2)
Out[283]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
0 3 0 8 0 3 1 4 5 0 0 39 3 11 0 0 0 10 4 18 7 0 0 No No No No No No No Steady No No No No No No No No No No No No No No No 0 Yes 1
1 3 0 7 0 5 3 1 6 7 0 79 1 25 3 0 0 16 13 16 9 0 0 Steady No No No No No No Steady No No No No No No No No No Up No No No No No 1 Yes 0
In [284]:
#distribution of No or Yes in the diabetesMed column
df.groupby('diabetesMed').count()
Out[284]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change readmitted
diabetesMed
No 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890 12890
Yes 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110 43110
In [285]:
#replace the values of the 'diabetesMed' column:
# No = 0
# Yes = 1

df = df.replace({'diabetesMed': {'No': 0, 'Yes': 1,}})

df.head(2)
Out[285]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
0 3 0 8 0 3 1 4 5 0 0 39 3 11 0 0 0 10 4 18 7 0 0 No No No No No No No Steady No No No No No No No No No No No No No No No 0 1 1
1 3 0 7 0 5 3 1 6 7 0 79 1 25 3 0 0 16 13 16 9 0 0 Steady No No No No No No Steady No No No No No No No No No Up No No No No No 1 1 0
In [286]:
#distribution of the medical specialty categories in the MED_SPEC_NUM column
df.groupby('MED_SPEC_NUM').count()
Out[286]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
MED_SPEC_NUM
0 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562 27562
1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
2 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
3 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12
4 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912 2912
5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
6 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
7 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
8 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189 4189
9 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66
10 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
11 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032 4032
12 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318 318
13 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37
14 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57
15 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118 118
16 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31
17 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23
18 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055 8055
19 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914
20 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122
21 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
22 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15
23 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
24 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361
25 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198 198
26 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17
27 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764 764
28 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651 651
29 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24
30 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64
31 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
32 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12
33 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130 130
34 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
35 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50
36 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
37 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92
38 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
39 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15
40 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219 219
41 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
42 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56
43 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
44 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468 468
45 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
46 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48
47 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499 499
48 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600 600
49 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29
50 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
51 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19
52 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49
53 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359 359
54 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
55 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703 1703
56 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
57 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269
58 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
59 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18
60 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55
61 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287 287
62 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22
63 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372 372
In [287]:
#distribution of diagnosis categories in the DIAG_CAT_1 column
df.groupby('DIAG_CAT_1').count()
Out[287]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
DIAG_CAT_1
0 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
1 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72
2 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900
3 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311 6311
4 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624
5 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233 1233
6 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632 1632
7 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
8 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89 89
9 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824 824
10 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774 5774
11 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292
12 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969 1969
13 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766 3766
14 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457 2457
15 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423 1423
16 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797 5797
17 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124 5124
18 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825 2825
19 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367 367
20 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462 1462
21 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677 2677
22 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31
23 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273 4273
24 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184 2184
25 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154 154
26 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49
27 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067 1067
28 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684 684
31 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
32 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926 926
In [288]:
#distribution of diagnosis categories in the DIAG_CAT_2 column
df.groupby('DIAG_CAT_2').count()
Out[288]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
DIAG_CAT_2
0 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180
1 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193 193
2 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392 1392
3 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530 11530
4 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649 1649
5 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436 1436
6 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964 964
8 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188 188
9 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946 3946
10 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984 3984
11 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113
12 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402 4402
13 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719 3719
14 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456 456
15 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781 781
16 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578 5578
17 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198 2198
18 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365 4365
19 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234 234
20 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113 2113
21 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972 972
22 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69
23 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635 2635
24 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595 595
25 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
26 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77
27 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269 269
28 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540 540
31 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396 396
32 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015
In [289]:
#distribution of diagnosis categories in the DIAG_CAT_3 column
df.groupby('DIAG_CAT_3').count()
Out[289]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
DIAG_CAT_3
0 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768 768
1 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196 196
2 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003 1003
3 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633 14633
4 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369 1369
5 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744 1744
6 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106 1106
8 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231 231
9 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087 6087
10 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161 3161
11 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114 114
12 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547 3547
13 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526 2526
14 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374 374
15 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741 741
16 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700 3700
17 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016
18 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532 3532
19 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150 150
20 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491 1491
21 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015 1015
22 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49
23 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502 2502
24 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511 511
25 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24
26 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159 159
27 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144
28 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361 361
31 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689 689
32 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057 2057
In [290]:
#replace the values in the medicene column:
# No = 0
# Down = 1
# Steady = 2
# Up = 3

df = df.replace({'metformin': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
df = df.replace({'repaglinide': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
df = df.replace({'nateglinide': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
df = df.replace({'chlorpropamide': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
df = df.replace({'glimepiride': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
df = df.replace({'acetohexamide': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
df = df.replace({'glipizide': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
df = df.replace({'glyburide': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
df = df.replace({'tolbutamide': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
df = df.replace({'pioglitazone': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
df = df.replace({'rosiglitazone': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
df = df.replace({'acarbose': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
df = df.replace({'miglitol': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
df = df.replace({'troglitazone': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
df = df.replace({'tolazamide': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
df = df.replace({'examide': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
df = df.replace({'citoglipton': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
df = df.replace({'insulin': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
df = df.replace({'glyburide.metformin': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
df = df.replace({'glipizide.metformin': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
df = df.replace({'glimepiride.pioglitazone': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
df = df.replace({'metformin.rosiglitazone': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
df = df.replace({'metformin.pioglitazone': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})

df.head(2)
Out[290]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
0 3 0 8 0 3 1 4 5 0 0 39 3 11 0 0 0 10 4 18 7 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
1 3 0 7 0 5 3 1 6 7 0 79 1 25 3 0 0 16 13 16 9 0 0 2 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 1 1 0
In [291]:
# check to make sure all factors are now int
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56000 entries, 0 to 55999
Data columns (total 48 columns):
race                        56000 non-null int64
gender                      56000 non-null int64
age                         56000 non-null int64
weight                      56000 non-null int64
admission_type_id           56000 non-null int64
discharge_disposition_id    56000 non-null int64
admission_source_id         56000 non-null int64
time_in_hospital            56000 non-null int64
payer_code                  56000 non-null int64
MED_SPEC_NUM                56000 non-null int64
num_lab_procedures          56000 non-null int64
num_procedures              56000 non-null int64
num_medications             56000 non-null int64
number_outpatient           56000 non-null int64
number_emergency            56000 non-null int64
number_inpatient            56000 non-null int64
DIAG_CAT_1                  56000 non-null int64
DIAG_CAT_2                  56000 non-null int64
DIAG_CAT_3                  56000 non-null int64
number_diagnoses            56000 non-null int64
max_glu_serum               56000 non-null int64
A1Cresult                   56000 non-null int64
metformin                   56000 non-null int64
repaglinide                 56000 non-null int64
nateglinide                 56000 non-null int64
chlorpropamide              56000 non-null int64
glimepiride                 56000 non-null int64
acetohexamide               56000 non-null int64
glipizide                   56000 non-null int64
glyburide                   56000 non-null int64
tolbutamide                 56000 non-null int64
pioglitazone                56000 non-null int64
rosiglitazone               56000 non-null int64
acarbose                    56000 non-null int64
miglitol                    56000 non-null int64
troglitazone                56000 non-null int64
tolazamide                  56000 non-null int64
examide                     56000 non-null int64
citoglipton                 56000 non-null int64
insulin                     56000 non-null int64
glyburide.metformin         56000 non-null int64
glipizide.metformin         56000 non-null int64
glimepiride.pioglitazone    56000 non-null int64
metformin.rosiglitazone     56000 non-null int64
metformin.pioglitazone      56000 non-null int64
change                      56000 non-null int64
diabetesMed                 56000 non-null int64
readmitted                  56000 non-null int64
dtypes: int64(48)
memory usage: 20.5 MB
In [292]:
# save converted data frame with only int to a new file
df_clean_NoString = df
In [293]:
# check to make sure all factors of the new data frame are int
df_clean_NoString.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56000 entries, 0 to 55999
Data columns (total 48 columns):
race                        56000 non-null int64
gender                      56000 non-null int64
age                         56000 non-null int64
weight                      56000 non-null int64
admission_type_id           56000 non-null int64
discharge_disposition_id    56000 non-null int64
admission_source_id         56000 non-null int64
time_in_hospital            56000 non-null int64
payer_code                  56000 non-null int64
MED_SPEC_NUM                56000 non-null int64
num_lab_procedures          56000 non-null int64
num_procedures              56000 non-null int64
num_medications             56000 non-null int64
number_outpatient           56000 non-null int64
number_emergency            56000 non-null int64
number_inpatient            56000 non-null int64
DIAG_CAT_1                  56000 non-null int64
DIAG_CAT_2                  56000 non-null int64
DIAG_CAT_3                  56000 non-null int64
number_diagnoses            56000 non-null int64
max_glu_serum               56000 non-null int64
A1Cresult                   56000 non-null int64
metformin                   56000 non-null int64
repaglinide                 56000 non-null int64
nateglinide                 56000 non-null int64
chlorpropamide              56000 non-null int64
glimepiride                 56000 non-null int64
acetohexamide               56000 non-null int64
glipizide                   56000 non-null int64
glyburide                   56000 non-null int64
tolbutamide                 56000 non-null int64
pioglitazone                56000 non-null int64
rosiglitazone               56000 non-null int64
acarbose                    56000 non-null int64
miglitol                    56000 non-null int64
troglitazone                56000 non-null int64
tolazamide                  56000 non-null int64
examide                     56000 non-null int64
citoglipton                 56000 non-null int64
insulin                     56000 non-null int64
glyburide.metformin         56000 non-null int64
glipizide.metformin         56000 non-null int64
glimepiride.pioglitazone    56000 non-null int64
metformin.rosiglitazone     56000 non-null int64
metformin.pioglitazone      56000 non-null int64
change                      56000 non-null int64
diabetesMed                 56000 non-null int64
readmitted                  56000 non-null int64
dtypes: int64(48)
memory usage: 20.5 MB
In [294]:
# write dataframe with no string values to new csv file
df_clean_NoString.to_csv('data/Challenge_1_Training_Work_Clean_NoString.csv')
In [386]:
# input exploratory analysis slide here
Image('Images/christensen_finalprj/Slide9.png')
Out[386]:

Exploratory data analysis

In [296]:
#import ETL patient data
df = pd.read_csv('data/Challenge_1_Training_Work_Clean_NoString.csv')
df.head(5)
Out[296]:
Unnamed: 0 race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
0 0 3 0 8 0 3 1 4 5 0 0 39 3 11 0 0 0 10 4 18 7 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
1 1 3 0 7 0 5 3 1 6 7 0 79 1 25 3 0 0 16 13 16 9 0 0 2 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 1 1 0
2 2 5 0 6 0 1 22 7 4 14 18 29 2 18 0 0 1 24 18 2 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 1 0
3 3 3 1 7 0 1 1 7 3 0 18 72 3 18 0 0 0 17 4 3 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 1 1
4 4 3 0 3 0 2 1 1 3 0 0 21 1 6 0 0 0 23 18 32 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
In [297]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56000 entries, 0 to 55999
Data columns (total 49 columns):
Unnamed: 0                  56000 non-null int64
race                        56000 non-null int64
gender                      56000 non-null int64
age                         56000 non-null int64
weight                      56000 non-null int64
admission_type_id           56000 non-null int64
discharge_disposition_id    56000 non-null int64
admission_source_id         56000 non-null int64
time_in_hospital            56000 non-null int64
payer_code                  56000 non-null int64
MED_SPEC_NUM                56000 non-null int64
num_lab_procedures          56000 non-null int64
num_procedures              56000 non-null int64
num_medications             56000 non-null int64
number_outpatient           56000 non-null int64
number_emergency            56000 non-null int64
number_inpatient            56000 non-null int64
DIAG_CAT_1                  56000 non-null int64
DIAG_CAT_2                  56000 non-null int64
DIAG_CAT_3                  56000 non-null int64
number_diagnoses            56000 non-null int64
max_glu_serum               56000 non-null int64
A1Cresult                   56000 non-null int64
metformin                   56000 non-null int64
repaglinide                 56000 non-null int64
nateglinide                 56000 non-null int64
chlorpropamide              56000 non-null int64
glimepiride                 56000 non-null int64
acetohexamide               56000 non-null int64
glipizide                   56000 non-null int64
glyburide                   56000 non-null int64
tolbutamide                 56000 non-null int64
pioglitazone                56000 non-null int64
rosiglitazone               56000 non-null int64
acarbose                    56000 non-null int64
miglitol                    56000 non-null int64
troglitazone                56000 non-null int64
tolazamide                  56000 non-null int64
examide                     56000 non-null int64
citoglipton                 56000 non-null int64
insulin                     56000 non-null int64
glyburide.metformin         56000 non-null int64
glipizide.metformin         56000 non-null int64
glimepiride.pioglitazone    56000 non-null int64
metformin.rosiglitazone     56000 non-null int64
metformin.pioglitazone      56000 non-null int64
change                      56000 non-null int64
diabetesMed                 56000 non-null int64
readmitted                  56000 non-null int64
dtypes: int64(49)
memory usage: 20.9 MB
In [298]:
#drop or remove the column 'Unnamed: 0' since this column is not used in the analysis and disply the result
df = df.drop('Unnamed: 0', axis=1)
df.head(2)
Out[298]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
0 3 0 8 0 3 1 4 5 0 0 39 3 11 0 0 0 10 4 18 7 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
1 3 0 7 0 5 3 1 6 7 0 79 1 25 3 0 0 16 13 16 9 0 0 2 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 1 1 0
In [299]:
# basic statistics
df.describe()
Out[299]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
count 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.0 56000.0 56000.000000 56000.000000 56000.000000 56000.0 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000
mean 2.602071 0.464464 6.096589 0.123946 2.016893 3.721821 5.756643 4.398161 4.369375 10.668643 43.141661 1.335893 16.009268 0.367321 0.196875 0.637054 14.213321 12.011054 11.357411 7.423750 0.092089 0.368375 0.398875 0.029911 0.014089 0.001732 0.102536 0.000036 0.254732 0.210071 0.000357 0.146161 0.125821 0.006214 0.000857 0.000107 0.000714 0.0 0.0 1.058839 0.013214 0.000321 0.0 0.000036 0.000036 0.462679 0.769821 0.572268
std 0.937754 0.498740 1.590761 0.712004 1.438340 5.291517 4.053838 2.984346 4.363828 15.595799 19.656507 1.702009 8.132455 1.249570 0.916820 1.270768 7.272908 7.443902 8.157131 1.931488 0.431655 0.890972 0.815169 0.247161 0.169132 0.060480 0.449274 0.008452 0.678992 0.627625 0.026724 0.525985 0.490002 0.112904 0.042249 0.014638 0.037790 0.0 0.0 1.102484 0.162472 0.025353 0.0 0.008452 0.008452 0.498610 0.420951 0.685018
min 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 0.000000 0.000000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.000000
25% 3.000000 0.000000 5.000000 0.000000 1.000000 1.000000 1.000000 2.000000 0.000000 0.000000 32.000000 0.000000 10.000000 0.000000 0.000000 0.000000 10.000000 4.000000 3.000000 6.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 1.000000 0.000000
50% 3.000000 0.000000 6.000000 0.000000 1.000000 1.000000 7.000000 4.000000 6.000000 4.000000 44.000000 1.000000 15.000000 0.000000 0.000000 0.000000 15.000000 12.000000 10.000000 8.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 1.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 1.000000 0.000000
75% 3.000000 1.000000 7.000000 0.000000 3.000000 4.000000 7.000000 6.000000 7.000000 18.000000 57.000000 2.000000 20.000000 0.000000 0.000000 1.000000 18.000000 17.000000 17.000000 9.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 2.000000 0.000000 0.000000 0.0 0.000000 0.000000 1.000000 1.000000 1.000000
max 5.000000 1.000000 9.000000 9.000000 8.000000 28.000000 25.000000 14.000000 16.000000 63.000000 132.000000 6.000000 75.000000 42.000000 76.000000 21.000000 32.000000 32.000000 32.000000 16.000000 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 2.000000 3.000000 3.000000 2.000000 3.000000 3.000000 3.000000 3.000000 2.000000 2.000000 0.0 0.0 3.000000 3.000000 2.000000 0.0 2.000000 2.000000 1.000000 1.000000 2.000000

Basic Statistics Notes

  • mean: caucasion, female, 60's, Urgent, Discharged/Txfr'd, Txfr from facility, 4.4 days, CP payer, 43 lab procedures, 16 meds,
  • 0.4 out patient visit prev yr, 0.2 ER visits, 0.64 Inpatient, 7.4 diag's, 0.09 Gluc, Ai 0.4,
  • Several Meds at 0 to little use - need to eliminate some meds
  • Need to use dummy variables for 'readmitted' and combine No/0 and >30/1, since the question is if admitted <30 only
In [300]:
# correlation analysis
df.corr()
Out[300]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
race 1.000000 0.061706 0.114255 0.040520 0.096587 0.005805 0.033113 -0.020364 0.041640 -0.030777 -0.023193 0.024391 0.022157 0.050845 -0.012812 -0.006053 0.042924 0.029594 0.016000 0.081672 0.054576 -0.013318 0.010548 0.025466 -0.004170 0.006801 0.008261 0.001793 0.018551 0.015784 -0.001455 0.026105 0.005938 0.013237 -0.001307 0.003106 0.003990 NaN NaN -0.039862 0.006384 0.005380 NaN -0.011726 0.001793 0.008300 -0.004537 0.014912
gender 0.061706 1.000000 -0.048579 0.014491 0.014578 -0.019566 -0.005222 -0.031088 0.000833 0.016623 -0.004968 0.061668 -0.023819 -0.005846 -0.024202 -0.013405 -0.034311 0.008083 0.008343 -0.007818 -0.001347 0.016539 0.001549 -0.004777 -0.005390 0.006481 -0.000156 -0.003935 0.026810 0.034631 -0.001727 0.002339 0.010843 0.010581 0.009920 0.007860 0.003242 NaN NaN 0.000247 0.002489 0.007965 NaN 0.004538 -0.003935 0.012476 0.015391 -0.013626
age 0.114255 -0.048579 1.000000 0.005716 -0.005747 0.113970 0.041070 0.107273 0.058032 -0.068202 0.025665 -0.028360 0.039010 0.029064 -0.089149 -0.047012 0.091837 0.077541 0.052021 0.243515 0.018618 -0.147559 -0.060696 0.045565 0.020363 0.012367 0.044360 0.002400 0.055867 0.076798 0.010110 0.013860 0.003034 0.008092 0.011788 -0.001978 0.003605 NaN NaN -0.079078 -0.002451 0.003658 NaN 0.002400 -0.000257 -0.037793 -0.025360 0.029704
weight 0.040520 0.014491 0.005716 1.000000 0.037503 -0.035383 0.003026 0.023652 0.047819 0.004630 0.090456 0.018693 0.011274 0.104440 0.003706 -0.009154 0.023982 0.031824 0.014000 0.054391 -0.037139 -0.021109 0.007304 -0.005440 0.010707 -0.000839 0.013694 -0.000736 0.017062 0.008707 -0.002326 0.026059 0.004232 0.010411 -0.003532 -0.001274 0.000692 NaN NaN -0.076697 -0.014159 -0.002207 NaN -0.000736 -0.000736 -0.041219 -0.030585 0.027236
admission_type_id 0.096587 0.014578 -0.005747 0.037503 1.000000 0.085986 0.098007 -0.014285 -0.136863 0.185351 -0.145869 0.131923 0.075711 0.030746 -0.018190 -0.032648 0.032151 -0.005648 -0.008918 -0.113991 0.352793 -0.043929 0.008631 -0.003481 -0.008099 0.007875 -0.003178 -0.002988 0.007991 -0.002804 0.006347 0.018570 0.022930 0.006061 -0.001414 0.003307 0.010291 NaN NaN -0.025368 -0.000573 -0.005046 NaN -0.002988 0.002888 0.003992 -0.003930 -0.008561
discharge_disposition_id 0.005805 -0.019566 0.113970 -0.035383 0.085986 1.000000 0.016614 0.161954 -0.123220 -0.024028 0.022906 0.015536 0.105415 -0.006101 -0.024692 0.019240 0.034616 0.029774 0.024778 0.049496 0.037086 -0.020713 -0.008376 -0.002759 -0.008790 0.018525 -0.022360 0.014597 -0.013379 0.048256 0.003228 -0.014116 -0.001694 0.006779 0.005779 0.008684 0.013139 NaN NaN -0.041842 -0.002994 0.000933 NaN -0.002174 -0.000576 -0.014047 -0.029452 0.009300
admission_source_id 0.033113 -0.005222 0.041070 0.003026 0.098007 0.016614 1.000000 -0.006996 -0.100157 -0.152760 0.046823 -0.137044 -0.055016 0.028833 0.061938 0.033697 -0.007753 -0.019796 0.001447 0.076318 0.412356 0.006512 -0.033283 -0.003732 -0.019612 0.002666 -0.026685 0.001296 0.009300 0.004919 0.001791 -0.005729 -0.008894 -0.000753 -0.000763 0.002245 0.001834 NaN NaN 0.005094 -0.024616 -0.000281 NaN 0.001296 -0.004958 0.002583 0.000535 0.030377
time_in_hospital -0.020364 -0.031088 0.107273 0.023652 -0.014285 0.161954 -0.006996 1.000000 -0.037805 0.023146 0.318234 0.193139 0.468752 -0.003410 -0.005467 0.079929 -0.019913 0.086503 0.068677 0.224265 0.029079 0.058088 -0.009071 0.034985 0.003320 0.004094 0.016086 0.013596 0.016737 0.023482 0.001799 0.008521 0.008531 0.007231 0.005083 0.004746 0.000328 NaN NaN 0.101223 -0.006358 -0.001692 NaN -0.003396 0.002268 0.112359 0.059464 0.057129
payer_code 0.041640 0.000833 0.058032 0.047819 -0.136863 -0.123220 -0.100157 -0.037805 1.000000 -0.082746 -0.049680 -0.047581 0.005658 0.062572 0.067316 0.009598 0.008458 0.036335 0.033135 0.076424 -0.095739 -0.006824 0.027596 0.032986 0.014676 -0.022046 0.038055 -0.004231 0.005875 -0.047599 -0.002662 0.034867 -0.008782 -0.002629 0.011455 -0.007329 -0.015677 NaN NaN 0.115265 0.055730 0.010871 NaN 0.009326 -0.000358 0.121010 0.077597 0.004353
MED_SPEC_NUM -0.030777 0.016623 -0.068202 0.004630 0.185351 -0.024028 -0.152760 0.023146 -0.082746 1.000000 -0.068863 0.076952 0.036943 -0.051445 -0.009879 -0.013909 0.018820 -0.019354 -0.015192 -0.176693 -0.003316 -0.009813 0.023068 0.010220 0.006590 0.002161 0.012798 -0.002891 0.007273 -0.005929 -0.001944 0.002210 0.016639 -0.005808 0.002816 -0.002660 -0.004689 NaN NaN -0.014342 0.000051 -0.006234 NaN -0.002891 0.010115 -0.005111 -0.002299 -0.044800
num_lab_procedures -0.023193 -0.004968 0.025665 0.090456 -0.145869 0.022906 0.046823 0.318234 -0.049680 -0.068863 1.000000 0.055081 0.267707 -0.008437 0.000613 0.037763 -0.071046 0.011204 0.011021 0.149116 -0.124907 0.236383 -0.044042 0.010438 -0.008292 -0.005659 0.005344 0.005344 0.012450 -0.001768 -0.001320 -0.015599 -0.010260 -0.000654 -0.002963 0.005036 0.000008 NaN NaN 0.085401 -0.010852 -0.006685 NaN 0.001689 -0.004330 0.062801 0.030903 0.035997
num_procedures 0.024391 0.061668 -0.028360 0.018693 0.131923 0.015536 -0.137044 0.193139 -0.047581 0.076952 0.055081 1.000000 0.387685 -0.028257 -0.033659 -0.061114 -0.056866 0.036607 0.025920 0.074394 -0.069910 -0.017477 -0.038122 0.005662 -0.002359 0.004757 0.007223 0.006615 0.004999 0.001531 -0.003423 0.016471 0.018742 -0.000362 -0.001521 -0.005745 0.005154 NaN NaN 0.015020 -0.000553 -0.006640 NaN -0.003317 -0.000834 0.005976 -0.009904 -0.037714
num_medications 0.022157 -0.023819 0.039010 0.011274 0.075711 0.105415 -0.055016 0.468752 0.005658 0.036943 0.267707 0.387685 1.000000 0.047313 0.017129 0.066793 0.004288 0.084268 0.063166 0.263311 0.001639 0.013044 0.069433 0.019283 0.023352 -0.000940 0.045223 0.009348 0.056985 0.030886 0.002943 0.071584 0.052860 0.017947 0.006422 0.002992 -0.002113 NaN NaN 0.198963 0.013382 0.002757 NaN -0.002603 0.002074 0.248529 0.186247 0.050711
number_outpatient 0.050845 -0.005846 0.029064 0.104440 0.030746 -0.006101 0.028833 -0.003410 0.062572 -0.051445 -0.008437 -0.028257 0.047313 1.000000 0.087824 0.103471 -0.009347 0.028015 0.026595 0.093518 0.054949 -0.024324 -0.013006 0.001026 0.002719 -0.004402 -0.009039 -0.001242 0.010527 -0.000482 0.000350 0.012212 -0.001550 0.009388 -0.002243 -0.002152 -0.005556 NaN NaN 0.010029 -0.008428 0.003037 NaN -0.001242 -0.001242 0.027105 0.017340 0.068145
number_emergency -0.012812 -0.024202 -0.089149 0.003706 -0.018190 -0.024692 0.061938 -0.005467 0.067316 -0.009879 0.000613 -0.033659 0.017129 0.087824 1.000000 0.279626 -0.023803 -0.004155 0.007427 0.059398 0.035679 -0.004270 -0.009572 0.007820 0.005489 -0.004218 0.003318 -0.000907 -0.003426 -0.027870 -0.002870 -0.001978 -0.006844 0.004224 -0.000207 -0.001572 -0.004059 NaN NaN 0.048501 0.001956 -0.002723 NaN -0.000907 -0.000907 0.041797 0.029415 0.103321
number_inpatient -0.006053 -0.013405 -0.047012 -0.009154 -0.032648 0.019240 0.033697 0.079929 0.009598 -0.013909 0.037763 -0.061114 0.066793 0.103471 0.279626 1.000000 -0.004620 0.024244 0.032150 0.102473 0.038503 -0.049379 -0.073780 0.011936 -0.006284 -0.008317 -0.016545 -0.002118 -0.022736 -0.036659 -0.003545 -0.026804 -0.021471 0.000411 -0.003851 -0.003669 -0.003526 NaN NaN 0.060505 -0.008426 -0.000813 NaN 0.001207 -0.002118 0.025420 0.025559 0.233149
DIAG_CAT_1 0.042924 -0.034311 0.091837 0.023982 0.032151 0.034616 -0.007753 -0.019913 0.008458 0.018820 -0.071046 -0.056866 0.004288 -0.009347 -0.023803 -0.004620 1.000000 0.025858 0.028021 0.046451 -0.016030 -0.091392 0.033199 0.002242 -0.000440 -0.002017 0.000410 0.001038 0.010541 0.017872 0.006039 0.024890 0.010041 0.003061 0.006030 0.000456 0.000745 NaN NaN -0.075260 0.015281 0.004664 NaN -0.003029 0.003943 -0.033688 -0.028985 -0.004994
DIAG_CAT_2 0.029594 0.008083 0.077541 0.031824 -0.005648 0.029774 -0.019796 0.086503 0.036335 -0.019354 0.011204 0.036607 0.084268 0.028015 -0.004155 0.024244 0.025858 1.000000 0.081391 0.171521 -0.017962 -0.044930 -0.018313 0.003082 -0.000322 -0.004128 0.006773 0.002264 0.004223 0.010435 0.000339 0.000030 -0.010618 0.000704 0.005705 0.001300 -0.003710 NaN NaN -0.007776 -0.007621 0.005659 NaN -0.000006 -0.005116 -0.006439 -0.010210 0.011850
DIAG_CAT_3 0.016000 0.008343 0.052021 0.014000 -0.008918 0.024778 0.001447 0.068677 0.033135 -0.015192 0.011021 0.025920 0.063166 0.026595 0.007427 0.032150 0.028021 0.081391 1.000000 0.186667 -0.009693 -0.031716 -0.024179 0.005636 0.003922 -0.007445 -0.010677 0.000333 -0.005554 -0.005157 -0.002879 -0.008180 -0.003303 0.000458 -0.000319 -0.000620 -0.003145 NaN NaN 0.013942 -0.000101 0.006007 NaN 0.006032 -0.004330 0.005824 -0.007452 0.027877
number_diagnoses 0.081672 -0.007818 0.243515 0.054391 -0.113991 0.049496 0.076318 0.224265 0.076424 -0.176693 0.149116 0.074394 0.263311 0.093518 0.059398 0.102473 0.046451 0.171521 0.186667 1.000000 -0.036161 -0.032983 -0.073736 0.033225 0.012336 -0.014080 0.013640 0.003449 -0.005975 -0.024247 0.001220 0.002278 -0.011524 0.007741 -0.000293 0.004710 -0.013444 NaN NaN 0.076730 -0.005894 -0.006428 NaN 0.003449 -0.007491 0.055250 0.019375 0.103885
max_glu_serum 0.054576 -0.001347 0.018618 -0.037139 0.352793 0.037086 0.412356 0.029079 -0.095739 -0.003316 -0.124907 -0.069910 0.001639 0.054949 0.035679 0.038503 -0.016030 -0.017962 -0.009693 -0.036161 1.000000 -0.043540 -0.029790 -0.015106 -0.016794 0.008938 -0.031840 -0.000902 0.005931 0.000373 0.006437 -0.014531 -0.009275 0.005479 -0.004328 -0.001562 -0.004032 NaN NaN 0.000884 -0.014296 -0.002705 NaN -0.000902 -0.000902 0.008958 -0.005206 0.017684
A1Cresult -0.013318 0.016539 -0.147559 -0.021109 -0.043929 -0.020713 0.006512 0.058088 -0.006824 -0.009813 0.236383 -0.017477 0.013044 -0.024324 -0.004270 -0.049379 -0.091392 -0.044930 -0.031716 -0.032983 -0.043540 1.000000 0.051894 0.022541 -0.000669 -0.003225 0.022787 -0.001747 0.020844 0.009977 -0.005526 0.000223 0.009548 0.009374 0.007741 -0.003026 -0.000390 NaN NaN 0.107227 -0.005008 0.001082 NaN -0.001747 -0.001747 0.105614 0.086291 -0.013614
metformin 0.010548 0.001549 -0.060696 0.007304 0.008631 -0.008376 -0.033283 -0.009071 0.027596 0.023068 -0.044042 -0.038122 0.069433 -0.013006 -0.009572 -0.073780 0.033199 -0.018313 -0.024179 -0.073736 -0.029790 0.051894 1.000000 -0.001074 0.020372 -0.011841 0.047475 -0.002068 0.077111 0.129061 -0.006539 0.060566 0.097708 0.006246 0.005628 -0.003582 0.004664 NaN NaN -0.017392 -0.021191 -0.002748 NaN 0.008300 0.003116 0.325302 0.267566 -0.035809
repaglinide 0.025466 -0.004777 0.045565 -0.005440 -0.003481 -0.002759 -0.003732 0.034985 0.032986 0.010220 0.010438 0.005662 0.019283 0.001026 0.007820 0.011936 0.002242 0.003082 0.005636 0.033225 -0.015106 0.022541 -0.001074 1.000000 -0.003246 -0.003466 -0.007518 -0.000511 -0.015927 -0.024160 -0.001617 0.019393 0.009031 0.011257 0.018066 -0.000886 -0.002287 NaN NaN 0.006058 -0.004506 -0.001534 NaN -0.000511 -0.000511 0.071294 0.066174 0.014286
nateglinide -0.004170 -0.005390 0.020363 0.010707 -0.008099 -0.008790 -0.019612 0.003320 0.014676 0.006590 -0.008292 -0.002359 0.023352 0.002719 0.005489 -0.006284 -0.000440 -0.000322 0.003922 0.012336 -0.016794 -0.000669 0.020372 -0.003246 1.000000 -0.002386 0.004488 -0.000352 -0.018191 -0.020817 -0.001113 0.025830 0.013947 -0.004585 0.018302 -0.000610 -0.001575 NaN NaN 0.001396 -0.006775 -0.001056 NaN -0.000352 -0.000352 0.052927 0.045552 0.007164
chlorpropamide 0.006801 0.006481 0.012367 -0.000839 0.007875 0.018525 0.002666 0.004094 -0.022046 0.002161 -0.005659 0.004757 -0.000940 -0.004402 -0.004218 -0.008317 -0.002017 -0.004128 -0.007445 -0.014080 0.008938 -0.003225 -0.011841 -0.003466 -0.002386 1.000000 -0.006537 -0.000121 -0.010745 -0.005823 -0.000383 -0.007959 -0.000123 -0.001576 -0.000581 -0.000210 -0.000541 NaN NaN -0.020008 -0.002329 -0.000363 NaN -0.000121 -0.000121 -0.007035 0.015661 -0.002806
glimepiride 0.008261 -0.000156 0.044360 0.013694 -0.003178 -0.022360 -0.026685 0.016086 0.038055 0.012798 0.005344 0.007223 0.045223 -0.009039 0.003318 -0.016545 0.000410 0.006773 -0.010677 0.013640 -0.031840 0.022787 0.047475 -0.007518 0.004488 -0.006537 1.000000 -0.000964 -0.071983 -0.067334 -0.003050 0.042601 0.038655 0.018418 0.019830 0.009191 -0.004314 NaN NaN 0.012479 -0.012202 -0.002894 NaN -0.000964 -0.000964 0.138970 0.124797 0.004760
acetohexamide 0.001793 -0.003935 0.002400 -0.000736 -0.002988 0.014597 0.001296 0.013596 -0.004231 -0.002891 0.005344 0.006615 0.009348 -0.001242 -0.000907 -0.002118 0.001038 0.002264 0.000333 0.003449 -0.000902 -0.001747 -0.002068 -0.000511 -0.000352 -0.000121 -0.000964 1.000000 -0.001585 -0.001414 -0.000056 -0.001174 -0.001085 -0.000233 -0.000086 -0.000031 -0.000080 NaN NaN 0.003607 -0.000344 -0.000054 NaN -0.000018 -0.000018 0.004554 0.002311 0.002639
glipizide 0.018551 0.026810 0.055867 0.017062 0.007991 -0.013379 0.009300 0.016737 0.005875 0.007273 0.012450 0.004999 0.056985 0.010527 -0.003426 -0.022736 0.010541 0.004223 -0.005554 -0.005975 0.005931 0.020844 0.077111 -0.015927 -0.018191 -0.010745 -0.071983 -0.001585 1.000000 -0.104495 -0.005014 0.049752 0.041498 0.030598 0.002971 -0.002746 -0.001524 NaN NaN -0.027179 -0.027923 -0.000607 NaN -0.001585 -0.001585 0.194260 0.205145 0.014766
glyburide 0.015784 0.034631 0.076798 0.008707 -0.002804 0.048256 0.004919 0.023482 -0.047599 -0.005929 -0.001768 0.001531 0.030886 -0.000482 -0.027870 -0.036659 0.017872 0.010435 -0.005157 -0.024247 0.000373 0.009977 0.129061 -0.024160 -0.020817 -0.005823 -0.067334 -0.001414 -0.104495 1.000000 -0.004473 0.027727 0.030766 0.015094 -0.000056 -0.002450 -0.006327 NaN NaN -0.071853 -0.006909 0.000245 NaN -0.001414 -0.001414 0.172392 0.183024 -0.004492
tolbutamide -0.001455 -0.001727 0.010110 -0.002326 0.006347 0.003228 0.001791 0.001799 -0.002662 -0.001944 -0.001320 -0.003423 0.002943 0.000350 -0.002870 -0.003545 0.006039 0.000339 -0.002879 0.001220 0.006437 -0.005526 -0.006539 -0.001617 -0.001113 -0.000383 -0.003050 -0.000056 -0.005014 -0.004473 1.000000 -0.003714 -0.003432 -0.000736 -0.000271 -0.000098 -0.000253 NaN NaN -0.001925 -0.001087 -0.000169 NaN -0.000056 -0.000056 0.001000 0.007308 -0.007263
pioglitazone 0.026105 0.002339 0.013860 0.026059 0.018570 -0.014116 -0.005729 0.008521 0.034867 0.002210 -0.015599 0.016471 0.071584 0.012212 -0.001978 -0.026804 0.024890 0.000030 -0.008180 0.002278 -0.014531 0.000223 0.060566 0.019393 0.025830 -0.007959 0.042601 -0.001174 0.049752 0.027727 -0.003714 1.000000 -0.062763 0.015377 0.000791 -0.002034 -0.001659 NaN NaN 0.003954 0.022117 0.007190 NaN -0.001174 0.014894 0.203180 0.151949 0.011002
rosiglitazone 0.005938 0.010843 0.003034 0.004232 0.022930 -0.001694 -0.008894 0.008531 -0.008782 0.016639 -0.010260 0.018742 0.052860 -0.001550 -0.006844 -0.021471 0.010041 -0.010618 -0.003303 -0.011524 -0.009275 0.009548 0.097708 0.009031 0.013947 -0.000123 0.038655 -0.001085 0.041498 0.030766 -0.003432 -0.062763 1.000000 0.002006 0.003416 0.008079 -0.000996 NaN NaN 0.004080 0.003340 -0.003256 NaN -0.001085 -0.001085 0.191641 0.140410 0.005522
acarbose 0.013237 0.010581 0.008092 0.010411 0.006061 0.006779 -0.000753 0.007231 -0.002629 -0.005808 -0.000654 -0.000362 0.017947 0.009388 0.004224 0.000411 0.003061 0.000704 0.000458 0.007741 0.005479 0.009374 0.006246 0.011257 -0.004585 -0.001576 0.018418 -0.000233 0.030598 0.015094 -0.000736 0.015377 0.002006 1.000000 -0.001117 -0.000403 -0.001040 NaN NaN -0.001790 0.013046 -0.000698 NaN -0.000233 -0.000233 0.047261 0.030097 0.007816
miglitol -0.001307 0.009920 0.011788 -0.003532 -0.001414 0.005779 -0.000763 0.005083 0.011455 0.002816 -0.002963 -0.001521 0.006422 -0.002243 -0.000207 -0.003851 0.006030 0.005705 -0.000319 -0.000293 -0.004328 0.007741 0.005628 0.018066 0.018302 -0.000581 0.019830 -0.000086 0.002971 -0.000056 -0.000271 0.000791 0.003416 -0.001117 1.000000 -0.000148 -0.000383 NaN NaN 0.000451 -0.001650 -0.000257 NaN -0.000086 -0.000086 0.018472 0.011094 0.003413
troglitazone 0.003106 0.007860 -0.001978 -0.001274 0.003307 0.008684 0.002245 0.004746 -0.007329 -0.002660 0.005036 -0.005745 0.002992 -0.002152 -0.001572 -0.003669 0.000456 0.001300 -0.000620 0.004710 -0.001562 -0.003026 -0.003582 -0.000886 -0.000610 -0.000210 0.009191 -0.000031 -0.002746 -0.002450 -0.000098 -0.002034 0.008079 -0.000403 -0.000148 1.000000 -0.000138 NaN NaN -0.000391 -0.000595 -0.000093 NaN -0.000031 -0.000031 0.007888 0.004002 0.001009
tolazamide 0.003990 0.003242 0.003605 0.000692 0.010291 0.013139 0.001834 0.000328 -0.015677 -0.004689 0.000008 0.005154 -0.002113 -0.005556 -0.004059 -0.003526 0.000745 -0.003710 -0.003145 -0.013444 -0.004032 -0.000390 0.004664 -0.002287 -0.001575 -0.000541 -0.004314 -0.000080 -0.001524 -0.006327 -0.000253 -0.001659 -0.000996 -0.001040 -0.000383 -0.000138 1.000000 NaN NaN -0.013867 -0.001537 -0.000240 NaN -0.000080 -0.000080 -0.002376 0.010336 -0.007513
examide NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
citoglipton NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
insulin -0.039862 0.000247 -0.079078 -0.076697 -0.025368 -0.041842 0.005094 0.101223 0.115265 -0.014342 0.085401 0.015020 0.198963 0.010029 0.048501 0.060505 -0.075260 -0.007776 0.013942 0.076730 0.000884 0.107227 -0.017392 0.006058 0.001396 -0.020008 0.012479 0.003607 -0.027179 -0.071853 -0.001925 0.003954 0.004080 -0.001790 0.000451 -0.000391 -0.013867 NaN NaN 1.000000 0.005828 -0.000677 NaN 0.003607 0.003607 0.461502 0.525169 0.040750
glyburide.metformin 0.006384 0.002489 -0.002451 -0.014159 -0.000573 -0.002994 -0.024616 -0.006358 0.055730 0.000051 -0.010852 -0.000553 0.013382 -0.008428 0.001956 -0.008426 0.015281 -0.007621 -0.000101 -0.005894 -0.014296 -0.005008 -0.021191 -0.004506 -0.006775 -0.002329 -0.012202 -0.000344 -0.027923 -0.006909 -0.001087 0.022117 0.003340 0.013046 -0.001650 -0.000595 -0.001537 NaN NaN 0.005828 1.000000 0.050992 NaN -0.000344 -0.000344 0.038712 0.044474 -0.001842
glipizide.metformin 0.005380 0.007965 0.003658 -0.002207 -0.005046 0.000933 -0.000281 -0.001692 0.010871 -0.006234 -0.006685 -0.006640 0.002757 0.003037 -0.002723 -0.000813 0.004664 0.005659 0.006007 -0.006428 -0.002705 0.001082 -0.002748 -0.001534 -0.001056 -0.000363 -0.002894 -0.000054 -0.000607 0.000245 -0.000169 0.007190 -0.003256 -0.000698 -0.000257 -0.000093 -0.000240 NaN NaN -0.000677 0.050992 1.000000 NaN -0.000054 -0.000054 0.010838 0.006933 0.001747
glimepiride.pioglitazone NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
metformin.rosiglitazone -0.011726 0.004538 0.002400 -0.000736 -0.002988 -0.002174 0.001296 -0.003396 0.009326 -0.002891 0.001689 -0.003317 -0.002603 -0.001242 -0.000907 0.001207 -0.003029 -0.000006 0.006032 0.003449 -0.000902 -0.001747 0.008300 -0.000511 -0.000352 -0.000121 -0.000964 -0.000018 -0.001585 -0.001414 -0.000056 -0.001174 -0.001085 -0.000233 -0.000086 -0.000031 -0.000080 NaN NaN 0.003607 -0.000344 -0.000054 NaN 1.000000 -0.000018 0.004554 0.002311 -0.003530
metformin.pioglitazone 0.001793 -0.003935 -0.000257 -0.000736 0.002888 -0.000576 -0.004958 0.002268 -0.000358 0.010115 -0.004330 -0.000834 0.002074 -0.001242 -0.000907 -0.002118 0.003943 -0.005116 -0.004330 -0.007491 -0.000902 -0.001747 0.003116 -0.000511 -0.000352 -0.000121 -0.000964 -0.000018 -0.001585 -0.001414 -0.000056 0.014894 -0.001085 -0.000233 -0.000086 -0.000031 -0.000080 NaN NaN 0.003607 -0.000344 -0.000054 NaN -0.000018 1.000000 0.004554 0.002311 -0.003530
change 0.008300 0.012476 -0.037793 -0.041219 0.003992 -0.014047 0.002583 0.112359 0.121010 -0.005111 0.062801 0.005976 0.248529 0.027105 0.041797 0.025420 -0.033688 -0.006439 0.005824 0.055250 0.008958 0.105614 0.325302 0.071294 0.052927 -0.007035 0.138970 0.004554 0.194260 0.172392 0.001000 0.203180 0.191641 0.047261 0.018472 0.007888 -0.002376 NaN NaN 0.461502 0.038712 0.010838 NaN 0.004554 0.004554 1.000000 0.507411 0.046717
diabetesMed -0.004537 0.015391 -0.025360 -0.030585 -0.003930 -0.029452 0.000535 0.059464 0.077597 -0.002299 0.030903 -0.009904 0.186247 0.017340 0.029415 0.025559 -0.028985 -0.010210 -0.007452 0.019375 -0.005206 0.086291 0.267566 0.066174 0.045552 0.015661 0.124797 0.002311 0.205145 0.183024 0.007308 0.151949 0.140410 0.030097 0.011094 0.004002 0.010336 NaN NaN 0.525169 0.044474 0.006933 NaN 0.002311 0.002311 0.507411 1.000000 0.058183
readmitted 0.014912 -0.013626 0.029704 0.027236 -0.008561 0.009300 0.030377 0.057129 0.004353 -0.044800 0.035997 -0.037714 0.050711 0.068145 0.103321 0.233149 -0.004994 0.011850 0.027877 0.103885 0.017684 -0.013614 -0.035809 0.014286 0.007164 -0.002806 0.004760 0.002639 0.014766 -0.004492 -0.007263 0.011002 0.005522 0.007816 0.003413 0.001009 -0.007513 NaN NaN 0.040750 -0.001842 0.001747 NaN -0.003530 -0.003530 0.046717 0.058183 1.000000

Correlation Analysis Notes:

  • Lots of noise - need to narrow

Strongest +'s:

  • #ER = 0.103
  • #Inpatient = 0.233
  • #Diag's = 0.104

Strongest -'s:

  • MED_SPEC_NUM: -0.045 --> Which is irrelevant at this point as they are alphabetically sorted
  • #Procedure's: -0.038
  • metformin: -0.036
  • No Correlation w/ several Med's
In [301]:
#drop or remove these columns since they are not used in any of the cases
df = df.drop('examide', axis=1)
df = df.drop('citoglipton', axis=1)
df = df.drop('glimepiride.pioglitazone', axis=1)

df.head(5)
Out[301]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide insulin glyburide.metformin glipizide.metformin metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
0 3 0 8 0 3 1 4 5 0 0 39 3 11 0 0 0 10 4 18 7 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
1 3 0 7 0 5 3 1 6 7 0 79 1 25 3 0 0 16 13 16 9 0 0 2 0 0 0 0 0 0 2 0 0 0 0 0 0 0 3 0 0 0 0 1 1 0
2 5 0 6 0 1 22 7 4 14 18 29 2 18 0 0 1 24 18 2 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 0
3 3 1 7 0 1 1 7 3 0 18 72 3 18 0 0 0 17 4 3 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 1
4 3 0 3 0 2 1 1 3 0 0 21 1 6 0 0 0 23 18 32 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
In [302]:
#basic statistics
df.describe()
Out[302]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide insulin glyburide.metformin glipizide.metformin metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
count 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000
mean 2.602071 0.464464 6.096589 0.123946 2.016893 3.721821 5.756643 4.398161 4.369375 10.668643 43.141661 1.335893 16.009268 0.367321 0.196875 0.637054 14.213321 12.011054 11.357411 7.423750 0.092089 0.368375 0.398875 0.029911 0.014089 0.001732 0.102536 0.000036 0.254732 0.210071 0.000357 0.146161 0.125821 0.006214 0.000857 0.000107 0.000714 1.058839 0.013214 0.000321 0.000036 0.000036 0.462679 0.769821 0.572268
std 0.937754 0.498740 1.590761 0.712004 1.438340 5.291517 4.053838 2.984346 4.363828 15.595799 19.656507 1.702009 8.132455 1.249570 0.916820 1.270768 7.272908 7.443902 8.157131 1.931488 0.431655 0.890972 0.815169 0.247161 0.169132 0.060480 0.449274 0.008452 0.678992 0.627625 0.026724 0.525985 0.490002 0.112904 0.042249 0.014638 0.037790 1.102484 0.162472 0.025353 0.008452 0.008452 0.498610 0.420951 0.685018
min 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 0.000000 0.000000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 3.000000 0.000000 5.000000 0.000000 1.000000 1.000000 1.000000 2.000000 0.000000 0.000000 32.000000 0.000000 10.000000 0.000000 0.000000 0.000000 10.000000 4.000000 3.000000 6.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000
50% 3.000000 0.000000 6.000000 0.000000 1.000000 1.000000 7.000000 4.000000 6.000000 4.000000 44.000000 1.000000 15.000000 0.000000 0.000000 0.000000 15.000000 12.000000 10.000000 8.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000
75% 3.000000 1.000000 7.000000 0.000000 3.000000 4.000000 7.000000 6.000000 7.000000 18.000000 57.000000 2.000000 20.000000 0.000000 0.000000 1.000000 18.000000 17.000000 17.000000 9.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000
max 5.000000 1.000000 9.000000 9.000000 8.000000 28.000000 25.000000 14.000000 16.000000 63.000000 132.000000 6.000000 75.000000 42.000000 76.000000 21.000000 32.000000 32.000000 32.000000 16.000000 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 2.000000 3.000000 3.000000 2.000000 3.000000 3.000000 3.000000 3.000000 2.000000 2.000000 3.000000 3.000000 2.000000 2.000000 2.000000 1.000000 1.000000 2.000000
In [303]:
# correlation analysis
df.corr()
Out[303]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide insulin glyburide.metformin glipizide.metformin metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
race 1.000000 0.061706 0.114255 0.040520 0.096587 0.005805 0.033113 -0.020364 0.041640 -0.030777 -0.023193 0.024391 0.022157 0.050845 -0.012812 -0.006053 0.042924 0.029594 0.016000 0.081672 0.054576 -0.013318 0.010548 0.025466 -0.004170 0.006801 0.008261 0.001793 0.018551 0.015784 -0.001455 0.026105 0.005938 0.013237 -0.001307 0.003106 0.003990 -0.039862 0.006384 0.005380 -0.011726 0.001793 0.008300 -0.004537 0.014912
gender 0.061706 1.000000 -0.048579 0.014491 0.014578 -0.019566 -0.005222 -0.031088 0.000833 0.016623 -0.004968 0.061668 -0.023819 -0.005846 -0.024202 -0.013405 -0.034311 0.008083 0.008343 -0.007818 -0.001347 0.016539 0.001549 -0.004777 -0.005390 0.006481 -0.000156 -0.003935 0.026810 0.034631 -0.001727 0.002339 0.010843 0.010581 0.009920 0.007860 0.003242 0.000247 0.002489 0.007965 0.004538 -0.003935 0.012476 0.015391 -0.013626
age 0.114255 -0.048579 1.000000 0.005716 -0.005747 0.113970 0.041070 0.107273 0.058032 -0.068202 0.025665 -0.028360 0.039010 0.029064 -0.089149 -0.047012 0.091837 0.077541 0.052021 0.243515 0.018618 -0.147559 -0.060696 0.045565 0.020363 0.012367 0.044360 0.002400 0.055867 0.076798 0.010110 0.013860 0.003034 0.008092 0.011788 -0.001978 0.003605 -0.079078 -0.002451 0.003658 0.002400 -0.000257 -0.037793 -0.025360 0.029704
weight 0.040520 0.014491 0.005716 1.000000 0.037503 -0.035383 0.003026 0.023652 0.047819 0.004630 0.090456 0.018693 0.011274 0.104440 0.003706 -0.009154 0.023982 0.031824 0.014000 0.054391 -0.037139 -0.021109 0.007304 -0.005440 0.010707 -0.000839 0.013694 -0.000736 0.017062 0.008707 -0.002326 0.026059 0.004232 0.010411 -0.003532 -0.001274 0.000692 -0.076697 -0.014159 -0.002207 -0.000736 -0.000736 -0.041219 -0.030585 0.027236
admission_type_id 0.096587 0.014578 -0.005747 0.037503 1.000000 0.085986 0.098007 -0.014285 -0.136863 0.185351 -0.145869 0.131923 0.075711 0.030746 -0.018190 -0.032648 0.032151 -0.005648 -0.008918 -0.113991 0.352793 -0.043929 0.008631 -0.003481 -0.008099 0.007875 -0.003178 -0.002988 0.007991 -0.002804 0.006347 0.018570 0.022930 0.006061 -0.001414 0.003307 0.010291 -0.025368 -0.000573 -0.005046 -0.002988 0.002888 0.003992 -0.003930 -0.008561
discharge_disposition_id 0.005805 -0.019566 0.113970 -0.035383 0.085986 1.000000 0.016614 0.161954 -0.123220 -0.024028 0.022906 0.015536 0.105415 -0.006101 -0.024692 0.019240 0.034616 0.029774 0.024778 0.049496 0.037086 -0.020713 -0.008376 -0.002759 -0.008790 0.018525 -0.022360 0.014597 -0.013379 0.048256 0.003228 -0.014116 -0.001694 0.006779 0.005779 0.008684 0.013139 -0.041842 -0.002994 0.000933 -0.002174 -0.000576 -0.014047 -0.029452 0.009300
admission_source_id 0.033113 -0.005222 0.041070 0.003026 0.098007 0.016614 1.000000 -0.006996 -0.100157 -0.152760 0.046823 -0.137044 -0.055016 0.028833 0.061938 0.033697 -0.007753 -0.019796 0.001447 0.076318 0.412356 0.006512 -0.033283 -0.003732 -0.019612 0.002666 -0.026685 0.001296 0.009300 0.004919 0.001791 -0.005729 -0.008894 -0.000753 -0.000763 0.002245 0.001834 0.005094 -0.024616 -0.000281 0.001296 -0.004958 0.002583 0.000535 0.030377
time_in_hospital -0.020364 -0.031088 0.107273 0.023652 -0.014285 0.161954 -0.006996 1.000000 -0.037805 0.023146 0.318234 0.193139 0.468752 -0.003410 -0.005467 0.079929 -0.019913 0.086503 0.068677 0.224265 0.029079 0.058088 -0.009071 0.034985 0.003320 0.004094 0.016086 0.013596 0.016737 0.023482 0.001799 0.008521 0.008531 0.007231 0.005083 0.004746 0.000328 0.101223 -0.006358 -0.001692 -0.003396 0.002268 0.112359 0.059464 0.057129
payer_code 0.041640 0.000833 0.058032 0.047819 -0.136863 -0.123220 -0.100157 -0.037805 1.000000 -0.082746 -0.049680 -0.047581 0.005658 0.062572 0.067316 0.009598 0.008458 0.036335 0.033135 0.076424 -0.095739 -0.006824 0.027596 0.032986 0.014676 -0.022046 0.038055 -0.004231 0.005875 -0.047599 -0.002662 0.034867 -0.008782 -0.002629 0.011455 -0.007329 -0.015677 0.115265 0.055730 0.010871 0.009326 -0.000358 0.121010 0.077597 0.004353
MED_SPEC_NUM -0.030777 0.016623 -0.068202 0.004630 0.185351 -0.024028 -0.152760 0.023146 -0.082746 1.000000 -0.068863 0.076952 0.036943 -0.051445 -0.009879 -0.013909 0.018820 -0.019354 -0.015192 -0.176693 -0.003316 -0.009813 0.023068 0.010220 0.006590 0.002161 0.012798 -0.002891 0.007273 -0.005929 -0.001944 0.002210 0.016639 -0.005808 0.002816 -0.002660 -0.004689 -0.014342 0.000051 -0.006234 -0.002891 0.010115 -0.005111 -0.002299 -0.044800
num_lab_procedures -0.023193 -0.004968 0.025665 0.090456 -0.145869 0.022906 0.046823 0.318234 -0.049680 -0.068863 1.000000 0.055081 0.267707 -0.008437 0.000613 0.037763 -0.071046 0.011204 0.011021 0.149116 -0.124907 0.236383 -0.044042 0.010438 -0.008292 -0.005659 0.005344 0.005344 0.012450 -0.001768 -0.001320 -0.015599 -0.010260 -0.000654 -0.002963 0.005036 0.000008 0.085401 -0.010852 -0.006685 0.001689 -0.004330 0.062801 0.030903 0.035997
num_procedures 0.024391 0.061668 -0.028360 0.018693 0.131923 0.015536 -0.137044 0.193139 -0.047581 0.076952 0.055081 1.000000 0.387685 -0.028257 -0.033659 -0.061114 -0.056866 0.036607 0.025920 0.074394 -0.069910 -0.017477 -0.038122 0.005662 -0.002359 0.004757 0.007223 0.006615 0.004999 0.001531 -0.003423 0.016471 0.018742 -0.000362 -0.001521 -0.005745 0.005154 0.015020 -0.000553 -0.006640 -0.003317 -0.000834 0.005976 -0.009904 -0.037714
num_medications 0.022157 -0.023819 0.039010 0.011274 0.075711 0.105415 -0.055016 0.468752 0.005658 0.036943 0.267707 0.387685 1.000000 0.047313 0.017129 0.066793 0.004288 0.084268 0.063166 0.263311 0.001639 0.013044 0.069433 0.019283 0.023352 -0.000940 0.045223 0.009348 0.056985 0.030886 0.002943 0.071584 0.052860 0.017947 0.006422 0.002992 -0.002113 0.198963 0.013382 0.002757 -0.002603 0.002074 0.248529 0.186247 0.050711
number_outpatient 0.050845 -0.005846 0.029064 0.104440 0.030746 -0.006101 0.028833 -0.003410 0.062572 -0.051445 -0.008437 -0.028257 0.047313 1.000000 0.087824 0.103471 -0.009347 0.028015 0.026595 0.093518 0.054949 -0.024324 -0.013006 0.001026 0.002719 -0.004402 -0.009039 -0.001242 0.010527 -0.000482 0.000350 0.012212 -0.001550 0.009388 -0.002243 -0.002152 -0.005556 0.010029 -0.008428 0.003037 -0.001242 -0.001242 0.027105 0.017340 0.068145
number_emergency -0.012812 -0.024202 -0.089149 0.003706 -0.018190 -0.024692 0.061938 -0.005467 0.067316 -0.009879 0.000613 -0.033659 0.017129 0.087824 1.000000 0.279626 -0.023803 -0.004155 0.007427 0.059398 0.035679 -0.004270 -0.009572 0.007820 0.005489 -0.004218 0.003318 -0.000907 -0.003426 -0.027870 -0.002870 -0.001978 -0.006844 0.004224 -0.000207 -0.001572 -0.004059 0.048501 0.001956 -0.002723 -0.000907 -0.000907 0.041797 0.029415 0.103321
number_inpatient -0.006053 -0.013405 -0.047012 -0.009154 -0.032648 0.019240 0.033697 0.079929 0.009598 -0.013909 0.037763 -0.061114 0.066793 0.103471 0.279626 1.000000 -0.004620 0.024244 0.032150 0.102473 0.038503 -0.049379 -0.073780 0.011936 -0.006284 -0.008317 -0.016545 -0.002118 -0.022736 -0.036659 -0.003545 -0.026804 -0.021471 0.000411 -0.003851 -0.003669 -0.003526 0.060505 -0.008426 -0.000813 0.001207 -0.002118 0.025420 0.025559 0.233149
DIAG_CAT_1 0.042924 -0.034311 0.091837 0.023982 0.032151 0.034616 -0.007753 -0.019913 0.008458 0.018820 -0.071046 -0.056866 0.004288 -0.009347 -0.023803 -0.004620 1.000000 0.025858 0.028021 0.046451 -0.016030 -0.091392 0.033199 0.002242 -0.000440 -0.002017 0.000410 0.001038 0.010541 0.017872 0.006039 0.024890 0.010041 0.003061 0.006030 0.000456 0.000745 -0.075260 0.015281 0.004664 -0.003029 0.003943 -0.033688 -0.028985 -0.004994
DIAG_CAT_2 0.029594 0.008083 0.077541 0.031824 -0.005648 0.029774 -0.019796 0.086503 0.036335 -0.019354 0.011204 0.036607 0.084268 0.028015 -0.004155 0.024244 0.025858 1.000000 0.081391 0.171521 -0.017962 -0.044930 -0.018313 0.003082 -0.000322 -0.004128 0.006773 0.002264 0.004223 0.010435 0.000339 0.000030 -0.010618 0.000704 0.005705 0.001300 -0.003710 -0.007776 -0.007621 0.005659 -0.000006 -0.005116 -0.006439 -0.010210 0.011850
DIAG_CAT_3 0.016000 0.008343 0.052021 0.014000 -0.008918 0.024778 0.001447 0.068677 0.033135 -0.015192 0.011021 0.025920 0.063166 0.026595 0.007427 0.032150 0.028021 0.081391 1.000000 0.186667 -0.009693 -0.031716 -0.024179 0.005636 0.003922 -0.007445 -0.010677 0.000333 -0.005554 -0.005157 -0.002879 -0.008180 -0.003303 0.000458 -0.000319 -0.000620 -0.003145 0.013942 -0.000101 0.006007 0.006032 -0.004330 0.005824 -0.007452 0.027877
number_diagnoses 0.081672 -0.007818 0.243515 0.054391 -0.113991 0.049496 0.076318 0.224265 0.076424 -0.176693 0.149116 0.074394 0.263311 0.093518 0.059398 0.102473 0.046451 0.171521 0.186667 1.000000 -0.036161 -0.032983 -0.073736 0.033225 0.012336 -0.014080 0.013640 0.003449 -0.005975 -0.024247 0.001220 0.002278 -0.011524 0.007741 -0.000293 0.004710 -0.013444 0.076730 -0.005894 -0.006428 0.003449 -0.007491 0.055250 0.019375 0.103885
max_glu_serum 0.054576 -0.001347 0.018618 -0.037139 0.352793 0.037086 0.412356 0.029079 -0.095739 -0.003316 -0.124907 -0.069910 0.001639 0.054949 0.035679 0.038503 -0.016030 -0.017962 -0.009693 -0.036161 1.000000 -0.043540 -0.029790 -0.015106 -0.016794 0.008938 -0.031840 -0.000902 0.005931 0.000373 0.006437 -0.014531 -0.009275 0.005479 -0.004328 -0.001562 -0.004032 0.000884 -0.014296 -0.002705 -0.000902 -0.000902 0.008958 -0.005206 0.017684
A1Cresult -0.013318 0.016539 -0.147559 -0.021109 -0.043929 -0.020713 0.006512 0.058088 -0.006824 -0.009813 0.236383 -0.017477 0.013044 -0.024324 -0.004270 -0.049379 -0.091392 -0.044930 -0.031716 -0.032983 -0.043540 1.000000 0.051894 0.022541 -0.000669 -0.003225 0.022787 -0.001747 0.020844 0.009977 -0.005526 0.000223 0.009548 0.009374 0.007741 -0.003026 -0.000390 0.107227 -0.005008 0.001082 -0.001747 -0.001747 0.105614 0.086291 -0.013614
metformin 0.010548 0.001549 -0.060696 0.007304 0.008631 -0.008376 -0.033283 -0.009071 0.027596 0.023068 -0.044042 -0.038122 0.069433 -0.013006 -0.009572 -0.073780 0.033199 -0.018313 -0.024179 -0.073736 -0.029790 0.051894 1.000000 -0.001074 0.020372 -0.011841 0.047475 -0.002068 0.077111 0.129061 -0.006539 0.060566 0.097708 0.006246 0.005628 -0.003582 0.004664 -0.017392 -0.021191 -0.002748 0.008300 0.003116 0.325302 0.267566 -0.035809
repaglinide 0.025466 -0.004777 0.045565 -0.005440 -0.003481 -0.002759 -0.003732 0.034985 0.032986 0.010220 0.010438 0.005662 0.019283 0.001026 0.007820 0.011936 0.002242 0.003082 0.005636 0.033225 -0.015106 0.022541 -0.001074 1.000000 -0.003246 -0.003466 -0.007518 -0.000511 -0.015927 -0.024160 -0.001617 0.019393 0.009031 0.011257 0.018066 -0.000886 -0.002287 0.006058 -0.004506 -0.001534 -0.000511 -0.000511 0.071294 0.066174 0.014286
nateglinide -0.004170 -0.005390 0.020363 0.010707 -0.008099 -0.008790 -0.019612 0.003320 0.014676 0.006590 -0.008292 -0.002359 0.023352 0.002719 0.005489 -0.006284 -0.000440 -0.000322 0.003922 0.012336 -0.016794 -0.000669 0.020372 -0.003246 1.000000 -0.002386 0.004488 -0.000352 -0.018191 -0.020817 -0.001113 0.025830 0.013947 -0.004585 0.018302 -0.000610 -0.001575 0.001396 -0.006775 -0.001056 -0.000352 -0.000352 0.052927 0.045552 0.007164
chlorpropamide 0.006801 0.006481 0.012367 -0.000839 0.007875 0.018525 0.002666 0.004094 -0.022046 0.002161 -0.005659 0.004757 -0.000940 -0.004402 -0.004218 -0.008317 -0.002017 -0.004128 -0.007445 -0.014080 0.008938 -0.003225 -0.011841 -0.003466 -0.002386 1.000000 -0.006537 -0.000121 -0.010745 -0.005823 -0.000383 -0.007959 -0.000123 -0.001576 -0.000581 -0.000210 -0.000541 -0.020008 -0.002329 -0.000363 -0.000121 -0.000121 -0.007035 0.015661 -0.002806
glimepiride 0.008261 -0.000156 0.044360 0.013694 -0.003178 -0.022360 -0.026685 0.016086 0.038055 0.012798 0.005344 0.007223 0.045223 -0.009039 0.003318 -0.016545 0.000410 0.006773 -0.010677 0.013640 -0.031840 0.022787 0.047475 -0.007518 0.004488 -0.006537 1.000000 -0.000964 -0.071983 -0.067334 -0.003050 0.042601 0.038655 0.018418 0.019830 0.009191 -0.004314 0.012479 -0.012202 -0.002894 -0.000964 -0.000964 0.138970 0.124797 0.004760
acetohexamide 0.001793 -0.003935 0.002400 -0.000736 -0.002988 0.014597 0.001296 0.013596 -0.004231 -0.002891 0.005344 0.006615 0.009348 -0.001242 -0.000907 -0.002118 0.001038 0.002264 0.000333 0.003449 -0.000902 -0.001747 -0.002068 -0.000511 -0.000352 -0.000121 -0.000964 1.000000 -0.001585 -0.001414 -0.000056 -0.001174 -0.001085 -0.000233 -0.000086 -0.000031 -0.000080 0.003607 -0.000344 -0.000054 -0.000018 -0.000018 0.004554 0.002311 0.002639
glipizide 0.018551 0.026810 0.055867 0.017062 0.007991 -0.013379 0.009300 0.016737 0.005875 0.007273 0.012450 0.004999 0.056985 0.010527 -0.003426 -0.022736 0.010541 0.004223 -0.005554 -0.005975 0.005931 0.020844 0.077111 -0.015927 -0.018191 -0.010745 -0.071983 -0.001585 1.000000 -0.104495 -0.005014 0.049752 0.041498 0.030598 0.002971 -0.002746 -0.001524 -0.027179 -0.027923 -0.000607 -0.001585 -0.001585 0.194260 0.205145 0.014766
glyburide 0.015784 0.034631 0.076798 0.008707 -0.002804 0.048256 0.004919 0.023482 -0.047599 -0.005929 -0.001768 0.001531 0.030886 -0.000482 -0.027870 -0.036659 0.017872 0.010435 -0.005157 -0.024247 0.000373 0.009977 0.129061 -0.024160 -0.020817 -0.005823 -0.067334 -0.001414 -0.104495 1.000000 -0.004473 0.027727 0.030766 0.015094 -0.000056 -0.002450 -0.006327 -0.071853 -0.006909 0.000245 -0.001414 -0.001414 0.172392 0.183024 -0.004492
tolbutamide -0.001455 -0.001727 0.010110 -0.002326 0.006347 0.003228 0.001791 0.001799 -0.002662 -0.001944 -0.001320 -0.003423 0.002943 0.000350 -0.002870 -0.003545 0.006039 0.000339 -0.002879 0.001220 0.006437 -0.005526 -0.006539 -0.001617 -0.001113 -0.000383 -0.003050 -0.000056 -0.005014 -0.004473 1.000000 -0.003714 -0.003432 -0.000736 -0.000271 -0.000098 -0.000253 -0.001925 -0.001087 -0.000169 -0.000056 -0.000056 0.001000 0.007308 -0.007263
pioglitazone 0.026105 0.002339 0.013860 0.026059 0.018570 -0.014116 -0.005729 0.008521 0.034867 0.002210 -0.015599 0.016471 0.071584 0.012212 -0.001978 -0.026804 0.024890 0.000030 -0.008180 0.002278 -0.014531 0.000223 0.060566 0.019393 0.025830 -0.007959 0.042601 -0.001174 0.049752 0.027727 -0.003714 1.000000 -0.062763 0.015377 0.000791 -0.002034 -0.001659 0.003954 0.022117 0.007190 -0.001174 0.014894 0.203180 0.151949 0.011002
rosiglitazone 0.005938 0.010843 0.003034 0.004232 0.022930 -0.001694 -0.008894 0.008531 -0.008782 0.016639 -0.010260 0.018742 0.052860 -0.001550 -0.006844 -0.021471 0.010041 -0.010618 -0.003303 -0.011524 -0.009275 0.009548 0.097708 0.009031 0.013947 -0.000123 0.038655 -0.001085 0.041498 0.030766 -0.003432 -0.062763 1.000000 0.002006 0.003416 0.008079 -0.000996 0.004080 0.003340 -0.003256 -0.001085 -0.001085 0.191641 0.140410 0.005522
acarbose 0.013237 0.010581 0.008092 0.010411 0.006061 0.006779 -0.000753 0.007231 -0.002629 -0.005808 -0.000654 -0.000362 0.017947 0.009388 0.004224 0.000411 0.003061 0.000704 0.000458 0.007741 0.005479 0.009374 0.006246 0.011257 -0.004585 -0.001576 0.018418 -0.000233 0.030598 0.015094 -0.000736 0.015377 0.002006 1.000000 -0.001117 -0.000403 -0.001040 -0.001790 0.013046 -0.000698 -0.000233 -0.000233 0.047261 0.030097 0.007816
miglitol -0.001307 0.009920 0.011788 -0.003532 -0.001414 0.005779 -0.000763 0.005083 0.011455 0.002816 -0.002963 -0.001521 0.006422 -0.002243 -0.000207 -0.003851 0.006030 0.005705 -0.000319 -0.000293 -0.004328 0.007741 0.005628 0.018066 0.018302 -0.000581 0.019830 -0.000086 0.002971 -0.000056 -0.000271 0.000791 0.003416 -0.001117 1.000000 -0.000148 -0.000383 0.000451 -0.001650 -0.000257 -0.000086 -0.000086 0.018472 0.011094 0.003413
troglitazone 0.003106 0.007860 -0.001978 -0.001274 0.003307 0.008684 0.002245 0.004746 -0.007329 -0.002660 0.005036 -0.005745 0.002992 -0.002152 -0.001572 -0.003669 0.000456 0.001300 -0.000620 0.004710 -0.001562 -0.003026 -0.003582 -0.000886 -0.000610 -0.000210 0.009191 -0.000031 -0.002746 -0.002450 -0.000098 -0.002034 0.008079 -0.000403 -0.000148 1.000000 -0.000138 -0.000391 -0.000595 -0.000093 -0.000031 -0.000031 0.007888 0.004002 0.001009
tolazamide 0.003990 0.003242 0.003605 0.000692 0.010291 0.013139 0.001834 0.000328 -0.015677 -0.004689 0.000008 0.005154 -0.002113 -0.005556 -0.004059 -0.003526 0.000745 -0.003710 -0.003145 -0.013444 -0.004032 -0.000390 0.004664 -0.002287 -0.001575 -0.000541 -0.004314 -0.000080 -0.001524 -0.006327 -0.000253 -0.001659 -0.000996 -0.001040 -0.000383 -0.000138 1.000000 -0.013867 -0.001537 -0.000240 -0.000080 -0.000080 -0.002376 0.010336 -0.007513
insulin -0.039862 0.000247 -0.079078 -0.076697 -0.025368 -0.041842 0.005094 0.101223 0.115265 -0.014342 0.085401 0.015020 0.198963 0.010029 0.048501 0.060505 -0.075260 -0.007776 0.013942 0.076730 0.000884 0.107227 -0.017392 0.006058 0.001396 -0.020008 0.012479 0.003607 -0.027179 -0.071853 -0.001925 0.003954 0.004080 -0.001790 0.000451 -0.000391 -0.013867 1.000000 0.005828 -0.000677 0.003607 0.003607 0.461502 0.525169 0.040750
glyburide.metformin 0.006384 0.002489 -0.002451 -0.014159 -0.000573 -0.002994 -0.024616 -0.006358 0.055730 0.000051 -0.010852 -0.000553 0.013382 -0.008428 0.001956 -0.008426 0.015281 -0.007621 -0.000101 -0.005894 -0.014296 -0.005008 -0.021191 -0.004506 -0.006775 -0.002329 -0.012202 -0.000344 -0.027923 -0.006909 -0.001087 0.022117 0.003340 0.013046 -0.001650 -0.000595 -0.001537 0.005828 1.000000 0.050992 -0.000344 -0.000344 0.038712 0.044474 -0.001842
glipizide.metformin 0.005380 0.007965 0.003658 -0.002207 -0.005046 0.000933 -0.000281 -0.001692 0.010871 -0.006234 -0.006685 -0.006640 0.002757 0.003037 -0.002723 -0.000813 0.004664 0.005659 0.006007 -0.006428 -0.002705 0.001082 -0.002748 -0.001534 -0.001056 -0.000363 -0.002894 -0.000054 -0.000607 0.000245 -0.000169 0.007190 -0.003256 -0.000698 -0.000257 -0.000093 -0.000240 -0.000677 0.050992 1.000000 -0.000054 -0.000054 0.010838 0.006933 0.001747
metformin.rosiglitazone -0.011726 0.004538 0.002400 -0.000736 -0.002988 -0.002174 0.001296 -0.003396 0.009326 -0.002891 0.001689 -0.003317 -0.002603 -0.001242 -0.000907 0.001207 -0.003029 -0.000006 0.006032 0.003449 -0.000902 -0.001747 0.008300 -0.000511 -0.000352 -0.000121 -0.000964 -0.000018 -0.001585 -0.001414 -0.000056 -0.001174 -0.001085 -0.000233 -0.000086 -0.000031 -0.000080 0.003607 -0.000344 -0.000054 1.000000 -0.000018 0.004554 0.002311 -0.003530
metformin.pioglitazone 0.001793 -0.003935 -0.000257 -0.000736 0.002888 -0.000576 -0.004958 0.002268 -0.000358 0.010115 -0.004330 -0.000834 0.002074 -0.001242 -0.000907 -0.002118 0.003943 -0.005116 -0.004330 -0.007491 -0.000902 -0.001747 0.003116 -0.000511 -0.000352 -0.000121 -0.000964 -0.000018 -0.001585 -0.001414 -0.000056 0.014894 -0.001085 -0.000233 -0.000086 -0.000031 -0.000080 0.003607 -0.000344 -0.000054 -0.000018 1.000000 0.004554 0.002311 -0.003530
change 0.008300 0.012476 -0.037793 -0.041219 0.003992 -0.014047 0.002583 0.112359 0.121010 -0.005111 0.062801 0.005976 0.248529 0.027105 0.041797 0.025420 -0.033688 -0.006439 0.005824 0.055250 0.008958 0.105614 0.325302 0.071294 0.052927 -0.007035 0.138970 0.004554 0.194260 0.172392 0.001000 0.203180 0.191641 0.047261 0.018472 0.007888 -0.002376 0.461502 0.038712 0.010838 0.004554 0.004554 1.000000 0.507411 0.046717
diabetesMed -0.004537 0.015391 -0.025360 -0.030585 -0.003930 -0.029452 0.000535 0.059464 0.077597 -0.002299 0.030903 -0.009904 0.186247 0.017340 0.029415 0.025559 -0.028985 -0.010210 -0.007452 0.019375 -0.005206 0.086291 0.267566 0.066174 0.045552 0.015661 0.124797 0.002311 0.205145 0.183024 0.007308 0.151949 0.140410 0.030097 0.011094 0.004002 0.010336 0.525169 0.044474 0.006933 0.002311 0.002311 0.507411 1.000000 0.058183
readmitted 0.014912 -0.013626 0.029704 0.027236 -0.008561 0.009300 0.030377 0.057129 0.004353 -0.044800 0.035997 -0.037714 0.050711 0.068145 0.103321 0.233149 -0.004994 0.011850 0.027877 0.103885 0.017684 -0.013614 -0.035809 0.014286 0.007164 -0.002806 0.004760 0.002639 0.014766 -0.004492 -0.007263 0.011002 0.005522 0.007816 0.003413 0.001009 -0.007513 0.040750 -0.001842 0.001747 -0.003530 -0.003530 0.046717 0.058183 1.000000

Preliminary possibilites correlated with readmitted

  • number_emergency = 0.103321
  • number_inpatient = 0.233149
  • number_diagnoses = 0.103885

No change from previous correlation analysis Strongest +'s:

  • #ER = 0.103
  • #Inpatient = 0.233
  • #Diag's = 0.104 Strongest -'s:
  • MED_SPEC_NUM: -0.045 --> Which is irrelevant at this point as they are alphabetically sorted
  • #Porcedure's: -0.038
  • metformin: -0.036
In [304]:
# heatmap for correlation
plt.figure(figsize=(35,36))
sns.heatmap(df.corr(), annot=True)
Out[304]:
<matplotlib.axes._subplots.AxesSubplot at 0x37dce6d8>
In [305]:
# describe for a single column
df['readmitted'].describe()
Out[305]:
count    56000.000000
mean         0.572268
std          0.685018
min          0.000000
25%          0.000000
50%          0.000000
75%          1.000000
max          2.000000
Name: readmitted, dtype: float64
In [306]:
# how many unique values in the 'readmitted' column
df.groupby('readmitted').size()
Out[306]:
readmitted
0    30238
1    19477
2     6285
dtype: int64
In [307]:
# how many missing values in each column or variable
df.isnull().sum()
Out[307]:
race                        0
gender                      0
age                         0
weight                      0
admission_type_id           0
discharge_disposition_id    0
admission_source_id         0
time_in_hospital            0
payer_code                  0
MED_SPEC_NUM                0
num_lab_procedures          0
num_procedures              0
num_medications             0
number_outpatient           0
number_emergency            0
number_inpatient            0
DIAG_CAT_1                  0
DIAG_CAT_2                  0
DIAG_CAT_3                  0
number_diagnoses            0
max_glu_serum               0
A1Cresult                   0
metformin                   0
repaglinide                 0
nateglinide                 0
chlorpropamide              0
glimepiride                 0
acetohexamide               0
glipizide                   0
glyburide                   0
tolbutamide                 0
pioglitazone                0
rosiglitazone               0
acarbose                    0
miglitol                    0
troglitazone                0
tolazamide                  0
insulin                     0
glyburide.metformin         0
glipizide.metformin         0
metformin.rosiglitazone     0
metformin.pioglitazone      0
change                      0
diabetesMed                 0
readmitted                  0
dtype: int64
In [308]:
# pivot table for 'readmitted'
df.groupby(['readmitted']).count()
Out[308]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide insulin glyburide.metformin glipizide.metformin metformin.rosiglitazone metformin.pioglitazone change diabetesMed
readmitted
0 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238 30238
1 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477 19477
2 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285
In [309]:
# pivot table for 'readmitted' showing mean value, not count
df.groupby(['readmitted']).mean()
Out[309]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide insulin glyburide.metformin glipizide.metformin metformin.rosiglitazone metformin.pioglitazone change diabetesMed
readmitted
0 2.589060 0.472088 6.050665 0.098717 2.025994 3.818606 5.613665 4.254415 4.325617 11.423771 42.454957 1.404061 15.639725 0.272042 0.105496 0.382036 14.233018 11.918612 11.127522 7.220914 0.085191 0.376315 0.422912 0.026324 0.012666 0.001654 0.099345 0.000000 0.243336 0.211820 0.000529 0.137245 0.119783 0.004928 0.000628 0.000066 0.000926 1.017792 0.013030 0.000265 0.000066 0.000066 0.438587 0.744857
1 2.614930 0.454177 6.146121 0.164091 2.012887 3.324383 5.958053 4.508703 4.467834 9.700056 43.880269 1.250655 16.344458 0.495456 0.294039 0.845356 14.223700 12.124352 11.623351 7.658366 0.098475 0.367613 0.381732 0.034091 0.016122 0.002310 0.108230 0.000103 0.270370 0.210197 0.000205 0.162448 0.139241 0.008472 0.001284 0.000205 0.000616 1.097808 0.014376 0.000411 0.000000 0.000000 0.491605 0.799096
2 2.624821 0.459666 6.164041 0.120923 1.985521 4.487828 5.820366 4.747176 4.274781 10.037232 44.156563 1.272076 16.748449 0.428640 0.335402 1.218457 14.086396 12.104694 11.639300 7.672554 0.105489 0.332538 0.336356 0.034208 0.014638 0.000318 0.100239 0.000000 0.261098 0.201273 0.000000 0.138584 0.113286 0.005410 0.000636 0.000000 0.000000 1.135561 0.010501 0.000318 0.000000 0.000000 0.488942 0.799204

Mean 'readmitted' Pivot Table Notes:

  • Doesn't appear to be a factor:
    • race
    • gender
    • admission_type_id / admission
    • payer_code
  • Slight possiblity of being a factor:
    • age
    • num_lab_procedures
    • num_procedures (interesting that those not returning at all had the highest average number of procedures at 1.404)
    • num_medications
    • number_diagnoses
    • insulin
    • change
    • diabetesMed
  • Appears to be a possible factor:
    • weight
    • discharge_disposition_id
    • time_in_hospital
    • number_outpatient
    • number_emergency
    • number_inpatient
    • max_gluc_serum
    • A1Cresult (negative factor)
    • metformin (neg)
In [310]:
#histograms for all factors
df.hist(figsize=(16,16))
Out[310]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000247A9908>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002A9259B0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000000247BCAC8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002A9F3FD0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002AA5DF60>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002AAB3908>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002A9FEE10>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000000002AA4AB00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002A918B70>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002A8E2D68>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002AB6DF60>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002ABBB710>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002ABF8080>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002AC407F0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000000002AC835C0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002ACD0320>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002AD0F080>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002AD55AC8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002ADA1278>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002ADDED30>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002AE2AD68>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000000002AE65D30>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002AEB34E0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002AEEF7F0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002AF3D080>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002AF836D8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002AFC84E0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002B014518>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000000002B04D4E0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002B093C50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002B0CDF60>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002B0F1C88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002B145860>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002B17EC18>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002B1CB3C8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000000002B20ADA0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002B2579E8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002B292860>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000000436FDFD0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000043750DA0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000004378F358>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000000437D79B0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000000043816908>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000043861940>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000004389D828>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000000438E7A58>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000000438B9860>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000043977240>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000000439BD6A0>]], dtype=object)

Since the goal is to find out what causes a 'readmitted' < 30 days,

need to combine those not readmitted and those readmitted > 30 days.

In [311]:
#create 2nd 'readmitted' column
df['readm2'] = df['readmitted']
df.head(12)
Out[311]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide insulin glyburide.metformin glipizide.metformin metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted readm2
0 3 0 8 0 3 1 4 5 0 0 39 3 11 0 0 0 10 4 18 7 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
1 3 0 7 0 5 3 1 6 7 0 79 1 25 3 0 0 16 13 16 9 0 0 2 0 0 0 0 0 0 2 0 0 0 0 0 0 0 3 0 0 0 0 1 1 0 0
2 5 0 6 0 1 22 7 4 14 18 29 2 18 0 0 1 24 18 2 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 0 0
3 3 1 7 0 1 1 7 3 0 18 72 3 18 0 0 0 17 4 3 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 1 1
4 3 0 3 0 2 1 1 3 0 0 21 1 6 0 0 0 23 18 32 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 3 1 7 0 2 1 1 2 0 18 4 0 7 0 0 0 14 9 3 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 3 0 4 0 1 1 7 6 14 33 89 0 25 0 2 1 25 10 16 9 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 1 1 0 0
7 3 0 7 0 1 6 7 4 6 0 63 0 22 0 2 4 16 3 3 5 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
8 3 1 6 0 1 1 7 6 7 0 45 0 24 0 0 0 16 9 3 7 0 3 2 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 0 0 0 0 1 1 0 0
9 3 1 8 0 1 1 7 2 3 0 45 0 13 0 0 0 17 3 3 9 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0
10 3 1 6 0 2 1 1 3 7 0 57 6 21 0 0 0 12 10 4 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 0 0
11 3 0 4 0 6 1 17 6 0 18 81 0 26 0 0 0 13 21 21 9 0 3 0 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0 0 0 0 0 1 1 2 2

NB: Using 12 rows because the 12th row (row 11) is the first instance of a patient being readmitted < 30 days

In [312]:
#replace the values of the 'readm2' column:
# NO = 0  ==> 0 -> NO
# >30 = 1 ==> 1 -> >30
# <30 = 2 ==> 2 -> <30

df = df.replace({'readm2': {0: 'NO', 1: '>30', 2: '<30'}})

df.head(12)
Out[312]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide insulin glyburide.metformin glipizide.metformin metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted readm2
0 3 0 8 0 3 1 4 5 0 0 39 3 11 0 0 0 10 4 18 7 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 >30
1 3 0 7 0 5 3 1 6 7 0 79 1 25 3 0 0 16 13 16 9 0 0 2 0 0 0 0 0 0 2 0 0 0 0 0 0 0 3 0 0 0 0 1 1 0 NO
2 5 0 6 0 1 22 7 4 14 18 29 2 18 0 0 1 24 18 2 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 0 NO
3 3 1 7 0 1 1 7 3 0 18 72 3 18 0 0 0 17 4 3 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 1 >30
4 3 0 3 0 2 1 1 3 0 0 21 1 6 0 0 0 23 18 32 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NO
5 3 1 7 0 2 1 1 2 0 18 4 0 7 0 0 0 14 9 3 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NO
6 3 0 4 0 1 1 7 6 14 33 89 0 25 0 2 1 25 10 16 9 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 1 1 0 NO
7 3 0 7 0 1 6 7 4 6 0 63 0 22 0 2 4 16 3 3 5 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 >30
8 3 1 6 0 1 1 7 6 7 0 45 0 24 0 0 0 16 9 3 7 0 3 2 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 0 0 0 0 1 1 0 NO
9 3 1 8 0 1 1 7 2 3 0 45 0 13 0 0 0 17 3 3 9 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 NO
10 3 1 6 0 2 1 1 3 7 0 57 6 21 0 0 0 12 10 4 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 0 NO
11 3 0 4 0 6 1 17 6 0 18 81 0 26 0 0 0 13 21 21 9 0 3 0 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0 0 0 0 0 1 1 2 <30
In [313]:
#replace the values of the 'readm2' column:
# NO = 0  ==> 0 -> NO ==> NO -> 0
# >30 = 1 ==> 1 -> >30 ==> >30 -> 0
# <30 = 2 ==> 2 -> <30 ==> <30 -> 1

df = df.replace({'readm2': {'NO': 0, '>30': 0, '<30': 1}})

df.head(12)
Out[313]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide insulin glyburide.metformin glipizide.metformin metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted readm2
0 3 0 8 0 3 1 4 5 0 0 39 3 11 0 0 0 10 4 18 7 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
1 3 0 7 0 5 3 1 6 7 0 79 1 25 3 0 0 16 13 16 9 0 0 2 0 0 0 0 0 0 2 0 0 0 0 0 0 0 3 0 0 0 0 1 1 0 0
2 5 0 6 0 1 22 7 4 14 18 29 2 18 0 0 1 24 18 2 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 0 0
3 3 1 7 0 1 1 7 3 0 18 72 3 18 0 0 0 17 4 3 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 1 0
4 3 0 3 0 2 1 1 3 0 0 21 1 6 0 0 0 23 18 32 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 3 1 7 0 2 1 1 2 0 18 4 0 7 0 0 0 14 9 3 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 3 0 4 0 1 1 7 6 14 33 89 0 25 0 2 1 25 10 16 9 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 1 1 0 0
7 3 0 7 0 1 6 7 4 6 0 63 0 22 0 2 4 16 3 3 5 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
8 3 1 6 0 1 1 7 6 7 0 45 0 24 0 0 0 16 9 3 7 0 3 2 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 0 0 0 0 1 1 0 0
9 3 1 8 0 1 1 7 2 3 0 45 0 13 0 0 0 17 3 3 9 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0
10 3 1 6 0 2 1 1 3 7 0 57 6 21 0 0 0 12 10 4 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 0 0
11 3 0 4 0 6 1 17 6 0 18 81 0 26 0 0 0 13 21 21 9 0 3 0 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0 0 0 0 0 1 1 2 1
In [314]:
# drop readmitted - will use readm2 as Y instead
df = df.drop('readmitted', axis=1)
df.head(12)
Out[314]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide insulin glyburide.metformin glipizide.metformin metformin.rosiglitazone metformin.pioglitazone change diabetesMed readm2
0 3 0 8 0 3 1 4 5 0 0 39 3 11 0 0 0 10 4 18 7 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
1 3 0 7 0 5 3 1 6 7 0 79 1 25 3 0 0 16 13 16 9 0 0 2 0 0 0 0 0 0 2 0 0 0 0 0 0 0 3 0 0 0 0 1 1 0
2 5 0 6 0 1 22 7 4 14 18 29 2 18 0 0 1 24 18 2 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 0
3 3 1 7 0 1 1 7 3 0 18 72 3 18 0 0 0 17 4 3 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 0
4 3 0 3 0 2 1 1 3 0 0 21 1 6 0 0 0 23 18 32 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 3 1 7 0 2 1 1 2 0 18 4 0 7 0 0 0 14 9 3 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 3 0 4 0 1 1 7 6 14 33 89 0 25 0 2 1 25 10 16 9 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 1 1 0
7 3 0 7 0 1 6 7 4 6 0 63 0 22 0 2 4 16 3 3 5 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
8 3 1 6 0 1 1 7 6 7 0 45 0 24 0 0 0 16 9 3 7 0 3 2 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 0 0 0 0 1 1 0
9 3 1 8 0 1 1 7 2 3 0 45 0 13 0 0 0 17 3 3 9 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0
10 3 1 6 0 2 1 1 3 7 0 57 6 21 0 0 0 12 10 4 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 0
11 3 0 4 0 6 1 17 6 0 18 81 0 26 0 0 0 13 21 21 9 0 3 0 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0 0 0 0 0 1 1 1
In [315]:
#basic statistics
df.describe()
Out[315]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide insulin glyburide.metformin glipizide.metformin metformin.rosiglitazone metformin.pioglitazone change diabetesMed readm2
count 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000
mean 2.602071 0.464464 6.096589 0.123946 2.016893 3.721821 5.756643 4.398161 4.369375 10.668643 43.141661 1.335893 16.009268 0.367321 0.196875 0.637054 14.213321 12.011054 11.357411 7.423750 0.092089 0.368375 0.398875 0.029911 0.014089 0.001732 0.102536 0.000036 0.254732 0.210071 0.000357 0.146161 0.125821 0.006214 0.000857 0.000107 0.000714 1.058839 0.013214 0.000321 0.000036 0.000036 0.462679 0.769821 0.112232
std 0.937754 0.498740 1.590761 0.712004 1.438340 5.291517 4.053838 2.984346 4.363828 15.595799 19.656507 1.702009 8.132455 1.249570 0.916820 1.270768 7.272908 7.443902 8.157131 1.931488 0.431655 0.890972 0.815169 0.247161 0.169132 0.060480 0.449274 0.008452 0.678992 0.627625 0.026724 0.525985 0.490002 0.112904 0.042249 0.014638 0.037790 1.102484 0.162472 0.025353 0.008452 0.008452 0.498610 0.420951 0.315655
min 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 0.000000 0.000000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 3.000000 0.000000 5.000000 0.000000 1.000000 1.000000 1.000000 2.000000 0.000000 0.000000 32.000000 0.000000 10.000000 0.000000 0.000000 0.000000 10.000000 4.000000 3.000000 6.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000
50% 3.000000 0.000000 6.000000 0.000000 1.000000 1.000000 7.000000 4.000000 6.000000 4.000000 44.000000 1.000000 15.000000 0.000000 0.000000 0.000000 15.000000 12.000000 10.000000 8.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000
75% 3.000000 1.000000 7.000000 0.000000 3.000000 4.000000 7.000000 6.000000 7.000000 18.000000 57.000000 2.000000 20.000000 0.000000 0.000000 1.000000 18.000000 17.000000 17.000000 9.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 0.000000
max 5.000000 1.000000 9.000000 9.000000 8.000000 28.000000 25.000000 14.000000 16.000000 63.000000 132.000000 6.000000 75.000000 42.000000 76.000000 21.000000 32.000000 32.000000 32.000000 16.000000 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 2.000000 3.000000 3.000000 2.000000 3.000000 3.000000 3.000000 3.000000 2.000000 2.000000 3.000000 3.000000 2.000000 2.000000 2.000000 1.000000 1.000000 1.000000
In [316]:
# describe for a single column
df['readm2'].describe()
Out[316]:
count    56000.000000
mean         0.112232
std          0.315655
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: readm2, dtype: float64
In [317]:
# how many unique values in the 'readm2' column
df.groupby('readm2').size()
Out[317]:
readm2
0    49715
1     6285
dtype: int64
In [318]:
# pivot talbe for 'readm2'
df.groupby(['readm2']).count()
Out[318]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide insulin glyburide.metformin glipizide.metformin metformin.rosiglitazone metformin.pioglitazone change diabetesMed
readm2
0 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715 49715
1 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285 6285
In [319]:
# pivot table for 'pep' showing mean value, not count
df.groupby(['readm2']).mean()
Out[319]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide insulin glyburide.metformin glipizide.metformin metformin.rosiglitazone metformin.pioglitazone change diabetesMed
readm2
0 2.599195 0.465071 6.088062 0.124329 2.020859 3.624982 5.748587 4.354038 4.381334 10.748466 43.013356 1.343961 15.915820 0.35957 0.179362 0.563552 14.229367 11.999216 11.321774 7.392296 0.090395 0.372906 0.406779 0.029367 0.014020 0.001911 0.102826 0.00004 0.253927 0.211184 0.000402 0.147119 0.127406 0.006316 0.000885 0.000121 0.000805 1.049140 0.013557 0.000322 0.00004 0.00004 0.459358 0.766107
1 2.624821 0.459666 6.164041 0.120923 1.985521 4.487828 5.820366 4.747176 4.274781 10.037232 44.156563 1.272076 16.748449 0.42864 0.335402 1.218457 14.086396 12.104694 11.639300 7.672554 0.105489 0.332538 0.336356 0.034208 0.014638 0.000318 0.100239 0.00000 0.261098 0.201273 0.000000 0.138584 0.113286 0.005410 0.000636 0.000000 0.000000 1.135561 0.010501 0.000318 0.00000 0.00000 0.488942 0.799204
In [320]:
#regression correlation
df.corr()
Out[320]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide insulin glyburide.metformin glipizide.metformin metformin.rosiglitazone metformin.pioglitazone change diabetesMed readm2
race 1.000000 0.061706 0.114255 0.040520 0.096587 0.005805 0.033113 -0.020364 0.041640 -0.030777 -0.023193 0.024391 0.022157 0.050845 -0.012812 -0.006053 0.042924 0.029594 0.016000 0.081672 0.054576 -0.013318 0.010548 0.025466 -0.004170 0.006801 0.008261 0.001793 0.018551 0.015784 -0.001455 0.026105 0.005938 0.013237 -0.001307 0.003106 0.003990 -0.039862 0.006384 0.005380 -0.011726 0.001793 0.008300 -0.004537 0.008626
gender 0.061706 1.000000 -0.048579 0.014491 0.014578 -0.019566 -0.005222 -0.031088 0.000833 0.016623 -0.004968 0.061668 -0.023819 -0.005846 -0.024202 -0.013405 -0.034311 0.008083 0.008343 -0.007818 -0.001347 0.016539 0.001549 -0.004777 -0.005390 0.006481 -0.000156 -0.003935 0.026810 0.034631 -0.001727 0.002339 0.010843 0.010581 0.009920 0.007860 0.003242 0.000247 0.002489 0.007965 0.004538 -0.003935 0.012476 0.015391 -0.003421
age 0.114255 -0.048579 1.000000 0.005716 -0.005747 0.113970 0.041070 0.107273 0.058032 -0.068202 0.025665 -0.028360 0.039010 0.029064 -0.089149 -0.047012 0.091837 0.077541 0.052021 0.243515 0.018618 -0.147559 -0.060696 0.045565 0.020363 0.012367 0.044360 0.002400 0.055867 0.076798 0.010110 0.013860 0.003034 0.008092 0.011788 -0.001978 0.003605 -0.079078 -0.002451 0.003658 0.002400 -0.000257 -0.037793 -0.025360 0.015077
weight 0.040520 0.014491 0.005716 1.000000 0.037503 -0.035383 0.003026 0.023652 0.047819 0.004630 0.090456 0.018693 0.011274 0.104440 0.003706 -0.009154 0.023982 0.031824 0.014000 0.054391 -0.037139 -0.021109 0.007304 -0.005440 0.010707 -0.000839 0.013694 -0.000736 0.017062 0.008707 -0.002326 0.026059 0.004232 0.010411 -0.003532 -0.001274 0.000692 -0.076697 -0.014159 -0.002207 -0.000736 -0.000736 -0.041219 -0.030585 -0.001510
admission_type_id 0.096587 0.014578 -0.005747 0.037503 1.000000 0.085986 0.098007 -0.014285 -0.136863 0.185351 -0.145869 0.131923 0.075711 0.030746 -0.018190 -0.032648 0.032151 -0.005648 -0.008918 -0.113991 0.352793 -0.043929 0.008631 -0.003481 -0.008099 0.007875 -0.003178 -0.002988 0.007991 -0.002804 0.006347 0.018570 0.022930 0.006061 -0.001414 0.003307 0.010291 -0.025368 -0.000573 -0.005046 -0.002988 0.002888 0.003992 -0.003930 -0.007755
discharge_disposition_id 0.005805 -0.019566 0.113970 -0.035383 0.085986 1.000000 0.016614 0.161954 -0.123220 -0.024028 0.022906 0.015536 0.105415 -0.006101 -0.024692 0.019240 0.034616 0.029774 0.024778 0.049496 0.037086 -0.020713 -0.008376 -0.002759 -0.008790 0.018525 -0.022360 0.014597 -0.013379 0.048256 0.003228 -0.014116 -0.001694 0.006779 0.005779 0.008684 0.013139 -0.041842 -0.002994 0.000933 -0.002174 -0.000576 -0.014047 -0.029452 0.051471
admission_source_id 0.033113 -0.005222 0.041070 0.003026 0.098007 0.016614 1.000000 -0.006996 -0.100157 -0.152760 0.046823 -0.137044 -0.055016 0.028833 0.061938 0.033697 -0.007753 -0.019796 0.001447 0.076318 0.412356 0.006512 -0.033283 -0.003732 -0.019612 0.002666 -0.026685 0.001296 0.009300 0.004919 0.001791 -0.005729 -0.008894 -0.000753 -0.000763 0.002245 0.001834 0.005094 -0.024616 -0.000281 0.001296 -0.004958 0.002583 0.000535 0.005589
time_in_hospital -0.020364 -0.031088 0.107273 0.023652 -0.014285 0.161954 -0.006996 1.000000 -0.037805 0.023146 0.318234 0.193139 0.468752 -0.003410 -0.005467 0.079929 -0.019913 0.086503 0.068677 0.224265 0.029079 0.058088 -0.009071 0.034985 0.003320 0.004094 0.016086 0.013596 0.016737 0.023482 0.001799 0.008521 0.008531 0.007231 0.005083 0.004746 0.000328 0.101223 -0.006358 -0.001692 -0.003396 0.002268 0.112359 0.059464 0.041582
payer_code 0.041640 0.000833 0.058032 0.047819 -0.136863 -0.123220 -0.100157 -0.037805 1.000000 -0.082746 -0.049680 -0.047581 0.005658 0.062572 0.067316 0.009598 0.008458 0.036335 0.033135 0.076424 -0.095739 -0.006824 0.027596 0.032986 0.014676 -0.022046 0.038055 -0.004231 0.005875 -0.047599 -0.002662 0.034867 -0.008782 -0.002629 0.011455 -0.007329 -0.015677 0.115265 0.055730 0.010871 0.009326 -0.000358 0.121010 0.077597 -0.007707
MED_SPEC_NUM -0.030777 0.016623 -0.068202 0.004630 0.185351 -0.024028 -0.152760 0.023146 -0.082746 1.000000 -0.068863 0.076952 0.036943 -0.051445 -0.009879 -0.013909 0.018820 -0.019354 -0.015192 -0.176693 -0.003316 -0.009813 0.023068 0.010220 0.006590 0.002161 0.012798 -0.002891 0.007273 -0.005929 -0.001944 0.002210 0.016639 -0.005808 0.002816 -0.002660 -0.004689 -0.014342 0.000051 -0.006234 -0.002891 0.010115 -0.005111 -0.002299 -0.014395
num_lab_procedures -0.023193 -0.004968 0.025665 0.090456 -0.145869 0.022906 0.046823 0.318234 -0.049680 -0.068863 1.000000 0.055081 0.267707 -0.008437 0.000613 0.037763 -0.071046 0.011204 0.011021 0.149116 -0.124907 0.236383 -0.044042 0.010438 -0.008292 -0.005659 0.005344 0.005344 0.012450 -0.001768 -0.001320 -0.015599 -0.010260 -0.000654 -0.002963 0.005036 0.000008 0.085401 -0.010852 -0.006685 0.001689 -0.004330 0.062801 0.030903 0.018358
num_procedures 0.024391 0.061668 -0.028360 0.018693 0.131923 0.015536 -0.137044 0.193139 -0.047581 0.076952 0.055081 1.000000 0.387685 -0.028257 -0.033659 -0.061114 -0.056866 0.036607 0.025920 0.074394 -0.069910 -0.017477 -0.038122 0.005662 -0.002359 0.004757 0.007223 0.006615 0.004999 0.001531 -0.003423 0.016471 0.018742 -0.000362 -0.001521 -0.005745 0.005154 0.015020 -0.000553 -0.006640 -0.003317 -0.000834 0.005976 -0.009904 -0.013332
num_medications 0.022157 -0.023819 0.039010 0.011274 0.075711 0.105415 -0.055016 0.468752 0.005658 0.036943 0.267707 0.387685 1.000000 0.047313 0.017129 0.066793 0.004288 0.084268 0.063166 0.263311 0.001639 0.013044 0.069433 0.019283 0.023352 -0.000940 0.045223 0.009348 0.056985 0.030886 0.002943 0.071584 0.052860 0.017947 0.006422 0.002992 -0.002113 0.198963 0.013382 0.002757 -0.002603 0.002074 0.248529 0.186247 0.032318
number_outpatient 0.050845 -0.005846 0.029064 0.104440 0.030746 -0.006101 0.028833 -0.003410 0.062572 -0.051445 -0.008437 -0.028257 0.047313 1.000000 0.087824 0.103471 -0.009347 0.028015 0.026595 0.093518 0.054949 -0.024324 -0.013006 0.001026 0.002719 -0.004402 -0.009039 -0.001242 0.010527 -0.000482 0.000350 0.012212 -0.001550 0.009388 -0.002243 -0.002152 -0.005556 0.010029 -0.008428 0.003037 -0.001242 -0.001242 0.027105 0.017340 0.017448
number_emergency -0.012812 -0.024202 -0.089149 0.003706 -0.018190 -0.024692 0.061938 -0.005467 0.067316 -0.009879 0.000613 -0.033659 0.017129 0.087824 1.000000 0.279626 -0.023803 -0.004155 0.007427 0.059398 0.035679 -0.004270 -0.009572 0.007820 0.005489 -0.004218 0.003318 -0.000907 -0.003426 -0.027870 -0.002870 -0.001978 -0.006844 0.004224 -0.000207 -0.001572 -0.004059 0.048501 0.001956 -0.002723 -0.000907 -0.000907 0.041797 0.029415 0.053723
number_inpatient -0.006053 -0.013405 -0.047012 -0.009154 -0.032648 0.019240 0.033697 0.079929 0.009598 -0.013909 0.037763 -0.061114 0.066793 0.103471 0.279626 1.000000 -0.004620 0.024244 0.032150 0.102473 0.038503 -0.049379 -0.073780 0.011936 -0.006284 -0.008317 -0.016545 -0.002118 -0.022736 -0.036659 -0.003545 -0.026804 -0.021471 0.000411 -0.003851 -0.003669 -0.003526 0.060505 -0.008426 -0.000813 0.001207 -0.002118 0.025420 0.025559 0.162676
DIAG_CAT_1 0.042924 -0.034311 0.091837 0.023982 0.032151 0.034616 -0.007753 -0.019913 0.008458 0.018820 -0.071046 -0.056866 0.004288 -0.009347 -0.023803 -0.004620 1.000000 0.025858 0.028021 0.046451 -0.016030 -0.091392 0.033199 0.002242 -0.000440 -0.002017 0.000410 0.001038 0.010541 0.017872 0.006039 0.024890 0.010041 0.003061 0.006030 0.000456 0.000745 -0.075260 0.015281 0.004664 -0.003029 0.003943 -0.033688 -0.028985 -0.006205
DIAG_CAT_2 0.029594 0.008083 0.077541 0.031824 -0.005648 0.029774 -0.019796 0.086503 0.036335 -0.019354 0.011204 0.036607 0.084268 0.028015 -0.004155 0.024244 0.025858 1.000000 0.081391 0.171521 -0.017962 -0.044930 -0.018313 0.003082 -0.000322 -0.004128 0.006773 0.002264 0.004223 0.010435 0.000339 0.000030 -0.010618 0.000704 0.005705 0.001300 -0.003710 -0.007776 -0.007621 0.005659 -0.000006 -0.005116 -0.006439 -0.010210 0.004473
DIAG_CAT_3 0.016000 0.008343 0.052021 0.014000 -0.008918 0.024778 0.001447 0.068677 0.033135 -0.015192 0.011021 0.025920 0.063166 0.026595 0.007427 0.032150 0.028021 0.081391 1.000000 0.186667 -0.009693 -0.031716 -0.024179 0.005636 0.003922 -0.007445 -0.010677 0.000333 -0.005554 -0.005157 -0.002879 -0.008180 -0.003303 0.000458 -0.000319 -0.000620 -0.003145 0.013942 -0.000101 0.006007 0.006032 -0.004330 0.005824 -0.007452 0.012287
number_diagnoses 0.081672 -0.007818 0.243515 0.054391 -0.113991 0.049496 0.076318 0.224265 0.076424 -0.176693 0.149116 0.074394 0.263311 0.093518 0.059398 0.102473 0.046451 0.171521 0.186667 1.000000 -0.036161 -0.032983 -0.073736 0.033225 0.012336 -0.014080 0.013640 0.003449 -0.005975 -0.024247 0.001220 0.002278 -0.011524 0.007741 -0.000293 0.004710 -0.013444 0.076730 -0.005894 -0.006428 0.003449 -0.007491 0.055250 0.019375 0.045801
max_glu_serum 0.054576 -0.001347 0.018618 -0.037139 0.352793 0.037086 0.412356 0.029079 -0.095739 -0.003316 -0.124907 -0.069910 0.001639 0.054949 0.035679 0.038503 -0.016030 -0.017962 -0.009693 -0.036161 1.000000 -0.043540 -0.029790 -0.015106 -0.016794 0.008938 -0.031840 -0.000902 0.005931 0.000373 0.006437 -0.014531 -0.009275 0.005479 -0.004328 -0.001562 -0.004032 0.000884 -0.014296 -0.002705 -0.000902 -0.000902 0.008958 -0.005206 0.011038
A1Cresult -0.013318 0.016539 -0.147559 -0.021109 -0.043929 -0.020713 0.006512 0.058088 -0.006824 -0.009813 0.236383 -0.017477 0.013044 -0.024324 -0.004270 -0.049379 -0.091392 -0.044930 -0.031716 -0.032983 -0.043540 1.000000 0.051894 0.022541 -0.000669 -0.003225 0.022787 -0.001747 0.020844 0.009977 -0.005526 0.000223 0.009548 0.009374 0.007741 -0.003026 -0.000390 0.107227 -0.005008 0.001082 -0.001747 -0.001747 0.105614 0.086291 -0.014302
metformin 0.010548 0.001549 -0.060696 0.007304 0.008631 -0.008376 -0.033283 -0.009071 0.027596 0.023068 -0.044042 -0.038122 0.069433 -0.013006 -0.009572 -0.073780 0.033199 -0.018313 -0.024179 -0.073736 -0.029790 0.051894 1.000000 -0.001074 0.020372 -0.011841 0.047475 -0.002068 0.077111 0.129061 -0.006539 0.060566 0.097708 0.006246 0.005628 -0.003582 0.004664 -0.017392 -0.021191 -0.002748 0.008300 0.003116 0.325302 0.267566 -0.027269
repaglinide 0.025466 -0.004777 0.045565 -0.005440 -0.003481 -0.002759 -0.003732 0.034985 0.032986 0.010220 0.010438 0.005662 0.019283 0.001026 0.007820 0.011936 0.002242 0.003082 0.005636 0.033225 -0.015106 0.022541 -0.001074 1.000000 -0.003246 -0.003466 -0.007518 -0.000511 -0.015927 -0.024160 -0.001617 0.019393 0.009031 0.011257 0.018066 -0.000886 -0.002287 0.006058 -0.004506 -0.001534 -0.000511 -0.000511 0.071294 0.066174 0.006183
nateglinide -0.004170 -0.005390 0.020363 0.010707 -0.008099 -0.008790 -0.019612 0.003320 0.014676 0.006590 -0.008292 -0.002359 0.023352 0.002719 0.005489 -0.006284 -0.000440 -0.000322 0.003922 0.012336 -0.016794 -0.000669 0.020372 -0.003246 1.000000 -0.002386 0.004488 -0.000352 -0.018191 -0.020817 -0.001113 0.025830 0.013947 -0.004585 0.018302 -0.000610 -0.001575 0.001396 -0.006775 -0.001056 -0.000352 -0.000352 0.052927 0.045552 0.001154
chlorpropamide 0.006801 0.006481 0.012367 -0.000839 0.007875 0.018525 0.002666 0.004094 -0.022046 0.002161 -0.005659 0.004757 -0.000940 -0.004402 -0.004218 -0.008317 -0.002017 -0.004128 -0.007445 -0.014080 0.008938 -0.003225 -0.011841 -0.003466 -0.002386 1.000000 -0.006537 -0.000121 -0.010745 -0.005823 -0.000383 -0.007959 -0.000123 -0.001576 -0.000581 -0.000210 -0.000541 -0.020008 -0.002329 -0.000363 -0.000121 -0.000121 -0.007035 0.015661 -0.008312
glimepiride 0.008261 -0.000156 0.044360 0.013694 -0.003178 -0.022360 -0.026685 0.016086 0.038055 0.012798 0.005344 0.007223 0.045223 -0.009039 0.003318 -0.016545 0.000410 0.006773 -0.010677 0.013640 -0.031840 0.022787 0.047475 -0.007518 0.004488 -0.006537 1.000000 -0.000964 -0.071983 -0.067334 -0.003050 0.042601 0.038655 0.018418 0.019830 0.009191 -0.004314 0.012479 -0.012202 -0.002894 -0.000964 -0.000964 0.138970 0.124797 -0.001818
acetohexamide 0.001793 -0.003935 0.002400 -0.000736 -0.002988 0.014597 0.001296 0.013596 -0.004231 -0.002891 0.005344 0.006615 0.009348 -0.001242 -0.000907 -0.002118 0.001038 0.002264 0.000333 0.003449 -0.000902 -0.001747 -0.002068 -0.000511 -0.000352 -0.000121 -0.000964 1.000000 -0.001585 -0.001414 -0.000056 -0.001174 -0.001085 -0.000233 -0.000086 -0.000031 -0.000080 0.003607 -0.000344 -0.000054 -0.000018 -0.000018 0.004554 0.002311 -0.001503
glipizide 0.018551 0.026810 0.055867 0.017062 0.007991 -0.013379 0.009300 0.016737 0.005875 0.007273 0.012450 0.004999 0.056985 0.010527 -0.003426 -0.022736 0.010541 0.004223 -0.005554 -0.005975 0.005931 0.020844 0.077111 -0.015927 -0.018191 -0.010745 -0.071983 -0.001585 1.000000 -0.104495 -0.005014 0.049752 0.041498 0.030598 0.002971 -0.002746 -0.001524 -0.027179 -0.027923 -0.000607 -0.001585 -0.001585 0.194260 0.205145 0.003333
glyburide 0.015784 0.034631 0.076798 0.008707 -0.002804 0.048256 0.004919 0.023482 -0.047599 -0.005929 -0.001768 0.001531 0.030886 -0.000482 -0.027870 -0.036659 0.017872 0.010435 -0.005157 -0.024247 0.000373 0.009977 0.129061 -0.024160 -0.020817 -0.005823 -0.067334 -0.001414 -0.104495 1.000000 -0.004473 0.027727 0.030766 0.015094 -0.000056 -0.002450 -0.006327 -0.071853 -0.006909 0.000245 -0.001414 -0.001414 0.172392 0.183024 -0.004985
tolbutamide -0.001455 -0.001727 0.010110 -0.002326 0.006347 0.003228 0.001791 0.001799 -0.002662 -0.001944 -0.001320 -0.003423 0.002943 0.000350 -0.002870 -0.003545 0.006039 0.000339 -0.002879 0.001220 0.006437 -0.005526 -0.006539 -0.001617 -0.001113 -0.000383 -0.003050 -0.000056 -0.005014 -0.004473 1.000000 -0.003714 -0.003432 -0.000736 -0.000271 -0.000098 -0.000253 -0.001925 -0.001087 -0.000169 -0.000056 -0.000056 0.001000 0.007308 -0.004752
pioglitazone 0.026105 0.002339 0.013860 0.026059 0.018570 -0.014116 -0.005729 0.008521 0.034867 0.002210 -0.015599 0.016471 0.071584 0.012212 -0.001978 -0.026804 0.024890 0.000030 -0.008180 0.002278 -0.014531 0.000223 0.060566 0.019393 0.025830 -0.007959 0.042601 -0.001174 0.049752 0.027727 -0.003714 1.000000 -0.062763 0.015377 0.000791 -0.002034 -0.001659 0.003954 0.022117 0.007190 -0.001174 0.014894 0.203180 0.151949 -0.005122
rosiglitazone 0.005938 0.010843 0.003034 0.004232 0.022930 -0.001694 -0.008894 0.008531 -0.008782 0.016639 -0.010260 0.018742 0.052860 -0.001550 -0.006844 -0.021471 0.010041 -0.010618 -0.003303 -0.011524 -0.009275 0.009548 0.097708 0.009031 0.013947 -0.000123 0.038655 -0.001085 0.041498 0.030766 -0.003432 -0.062763 1.000000 0.002006 0.003416 0.008079 -0.000996 0.004080 0.003340 -0.003256 -0.001085 -0.001085 0.191641 0.140410 -0.009096
acarbose 0.013237 0.010581 0.008092 0.010411 0.006061 0.006779 -0.000753 0.007231 -0.002629 -0.005808 -0.000654 -0.000362 0.017947 0.009388 0.004224 0.000411 0.003061 0.000704 0.000458 0.007741 0.005479 0.009374 0.006246 0.011257 -0.004585 -0.001576 0.018418 -0.000233 0.030598 0.015094 -0.000736 0.015377 0.002006 1.000000 -0.001117 -0.000403 -0.001040 -0.001790 0.013046 -0.000698 -0.000233 -0.000233 0.047261 0.030097 -0.002534
miglitol -0.001307 0.009920 0.011788 -0.003532 -0.001414 0.005779 -0.000763 0.005083 0.011455 0.002816 -0.002963 -0.001521 0.006422 -0.002243 -0.000207 -0.003851 0.006030 0.005705 -0.000319 -0.000293 -0.004328 0.007741 0.005628 0.018066 0.018302 -0.000581 0.019830 -0.000086 0.002971 -0.000056 -0.000271 0.000791 0.003416 -0.001117 1.000000 -0.000148 -0.000383 0.000451 -0.001650 -0.000257 -0.000086 -0.000086 0.018472 0.011094 -0.001857
troglitazone 0.003106 0.007860 -0.001978 -0.001274 0.003307 0.008684 0.002245 0.004746 -0.007329 -0.002660 0.005036 -0.005745 0.002992 -0.002152 -0.001572 -0.003669 0.000456 0.001300 -0.000620 0.004710 -0.001562 -0.003026 -0.003582 -0.000886 -0.000610 -0.000210 0.009191 -0.000031 -0.002746 -0.002450 -0.000098 -0.002034 0.008079 -0.000403 -0.000148 1.000000 -0.000138 -0.000391 -0.000595 -0.000093 -0.000031 -0.000031 0.007888 0.004002 -0.002602
tolazamide 0.003990 0.003242 0.003605 0.000692 0.010291 0.013139 0.001834 0.000328 -0.015677 -0.004689 0.000008 0.005154 -0.002113 -0.005556 -0.004059 -0.003526 0.000745 -0.003710 -0.003145 -0.013444 -0.004032 -0.000390 0.004664 -0.002287 -0.001575 -0.000541 -0.004314 -0.000080 -0.001524 -0.006327 -0.000253 -0.001659 -0.000996 -0.001040 -0.000383 -0.000138 1.000000 -0.013867 -0.001537 -0.000240 -0.000080 -0.000080 -0.002376 0.010336 -0.006721
insulin -0.039862 0.000247 -0.079078 -0.076697 -0.025368 -0.041842 0.005094 0.101223 0.115265 -0.014342 0.085401 0.015020 0.198963 0.010029 0.048501 0.060505 -0.075260 -0.007776 0.013942 0.076730 0.000884 0.107227 -0.017392 0.006058 0.001396 -0.020008 0.012479 0.003607 -0.027179 -0.071853 -0.001925 0.003954 0.004080 -0.001790 0.000451 -0.000391 -0.013867 1.000000 0.005828 -0.000677 0.003607 0.003607 0.461502 0.525169 0.024743
glyburide.metformin 0.006384 0.002489 -0.002451 -0.014159 -0.000573 -0.002994 -0.024616 -0.006358 0.055730 0.000051 -0.010852 -0.000553 0.013382 -0.008428 0.001956 -0.008426 0.015281 -0.007621 -0.000101 -0.005894 -0.014296 -0.005008 -0.021191 -0.004506 -0.006775 -0.002329 -0.012202 -0.000344 -0.027923 -0.006909 -0.001087 0.022117 0.003340 0.013046 -0.001650 -0.000595 -0.001537 0.005828 1.000000 0.050992 -0.000344 -0.000344 0.038712 0.044474 -0.005937
glipizide.metformin 0.005380 0.007965 0.003658 -0.002207 -0.005046 0.000933 -0.000281 -0.001692 0.010871 -0.006234 -0.006685 -0.006640 0.002757 0.003037 -0.002723 -0.000813 0.004664 0.005659 0.006007 -0.006428 -0.002705 0.001082 -0.002748 -0.001534 -0.001056 -0.000363 -0.002894 -0.000054 -0.000607 0.000245 -0.000169 0.007190 -0.003256 -0.000698 -0.000257 -0.000093 -0.000240 -0.000677 0.050992 1.000000 -0.000054 -0.000054 0.010838 0.006933 -0.000045
metformin.rosiglitazone -0.011726 0.004538 0.002400 -0.000736 -0.002988 -0.002174 0.001296 -0.003396 0.009326 -0.002891 0.001689 -0.003317 -0.002603 -0.001242 -0.000907 0.001207 -0.003029 -0.000006 0.006032 0.003449 -0.000902 -0.001747 0.008300 -0.000511 -0.000352 -0.000121 -0.000964 -0.000018 -0.001585 -0.001414 -0.000056 -0.001174 -0.001085 -0.000233 -0.000086 -0.000031 -0.000080 0.003607 -0.000344 -0.000054 1.000000 -0.000018 0.004554 0.002311 -0.001503
metformin.pioglitazone 0.001793 -0.003935 -0.000257 -0.000736 0.002888 -0.000576 -0.004958 0.002268 -0.000358 0.010115 -0.004330 -0.000834 0.002074 -0.001242 -0.000907 -0.002118 0.003943 -0.005116 -0.004330 -0.007491 -0.000902 -0.001747 0.003116 -0.000511 -0.000352 -0.000121 -0.000964 -0.000018 -0.001585 -0.001414 -0.000056 0.014894 -0.001085 -0.000233 -0.000086 -0.000031 -0.000080 0.003607 -0.000344 -0.000054 -0.000018 1.000000 0.004554 0.002311 -0.001503
change 0.008300 0.012476 -0.037793 -0.041219 0.003992 -0.014047 0.002583 0.112359 0.121010 -0.005111 0.062801 0.005976 0.248529 0.027105 0.041797 0.025420 -0.033688 -0.006439 0.005824 0.055250 0.008958 0.105614 0.325302 0.071294 0.052927 -0.007035 0.138970 0.004554 0.194260 0.172392 0.001000 0.203180 0.191641 0.047261 0.018472 0.007888 -0.002376 0.461502 0.038712 0.010838 0.004554 0.004554 1.000000 0.507411 0.018728
diabetesMed -0.004537 0.015391 -0.025360 -0.030585 -0.003930 -0.029452 0.000535 0.059464 0.077597 -0.002299 0.030903 -0.009904 0.186247 0.017340 0.029415 0.025559 -0.028985 -0.010210 -0.007452 0.019375 -0.005206 0.086291 0.267566 0.066174 0.045552 0.015661 0.124797 0.002311 0.205145 0.183024 0.007308 0.151949 0.140410 0.030097 0.011094 0.004002 0.010336 0.525169 0.044474 0.006933 0.002311 0.002311 0.507411 1.000000 0.024819
readm2 0.008626 -0.003421 0.015077 -0.001510 -0.007755 0.051471 0.005589 0.041582 -0.007707 -0.014395 0.018358 -0.013332 0.032318 0.017448 0.053723 0.162676 -0.006205 0.004473 0.012287 0.045801 0.011038 -0.014302 -0.027269 0.006183 0.001154 -0.008312 -0.001818 -0.001503 0.003333 -0.004985 -0.004752 -0.005122 -0.009096 -0.002534 -0.001857 -0.002602 -0.006721 0.024743 -0.005937 -0.000045 -0.001503 -0.001503 0.018728 0.024819 1.000000

Preliminary possibilites correlated with readm2 has changed versus readmitted

  • number_emergency = 0.103321 ==> No longer is showing significant correlation now at 0.053
  • number_inpatient = 0.233149 ==> Is now the only one showing any significant correlation at 0.162
  • number_diagnoses = 0.103885 ==> No longer is showing significant correlation now at 0.045
In [321]:
# heatmap for correlation
plt.figure(figsize=(35,36))
sns.heatmap(df.corr(), annot=True)
Out[321]:
<matplotlib.axes._subplots.AxesSubplot at 0x484f2c88>
In [387]:
# input model building slide here
Image('Images/christensen_finalprj/Slide10.png')
Out[387]:

Classification Model building.

  • Need to narrow down the number of factors in order to focus on the most significant ones

ExtraTreeClassifier

In [323]:
# Set Y and X

y = df['readm2']
X = df.drop(['readm2'], axis=1)
X.head(12)
Out[323]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide insulin glyburide.metformin glipizide.metformin metformin.rosiglitazone metformin.pioglitazone change diabetesMed
0 3 0 8 0 3 1 4 5 0 0 39 3 11 0 0 0 10 4 18 7 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1
1 3 0 7 0 5 3 1 6 7 0 79 1 25 3 0 0 16 13 16 9 0 0 2 0 0 0 0 0 0 2 0 0 0 0 0 0 0 3 0 0 0 0 1 1
2 5 0 6 0 1 22 7 4 14 18 29 2 18 0 0 1 24 18 2 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1
3 3 1 7 0 1 1 7 3 0 18 72 3 18 0 0 0 17 4 3 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1
4 3 0 3 0 2 1 1 3 0 0 21 1 6 0 0 0 23 18 32 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 3 1 7 0 2 1 1 2 0 18 4 0 7 0 0 0 14 9 3 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 3 0 4 0 1 1 7 6 14 33 89 0 25 0 2 1 25 10 16 9 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 1 1
7 3 0 7 0 1 6 7 4 6 0 63 0 22 0 2 4 16 3 3 5 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
8 3 1 6 0 1 1 7 6 7 0 45 0 24 0 0 0 16 9 3 7 0 3 2 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 0 0 0 0 1 1
9 3 1 8 0 1 1 7 2 3 0 45 0 13 0 0 0 17 3 3 9 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1
10 3 1 6 0 2 1 1 3 7 0 57 6 21 0 0 0 12 10 4 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1
11 3 0 4 0 6 1 17 6 0 18 81 0 26 0 0 0 13 21 21 9 0 3 0 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0 0 0 0 0 1 1
In [324]:
# build logisticRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

lr = LogisticRegression()
lr.fit(X_train, y_train)
Out[324]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
In [325]:
# build ExtraTreesClassifier

model_extra = ExtraTreesClassifier()
model_extra.fit(X, y)
model_extra.score(X, y)

# display the relative importance of each attribute
print(model_extra.feature_importances_)
[  2.91794982e-02   2.53679647e-02   5.37078564e-02   7.50331122e-03
   3.35746117e-02   4.57729444e-02   2.97595433e-02   5.66506897e-02
   4.27513308e-02   4.18174543e-02   6.59548839e-02   4.37842852e-02
   6.46074445e-02   2.93875786e-02   2.45889522e-02   5.46483089e-02
   5.79017039e-02   5.91816193e-02   5.91892575e-02   4.36703730e-02
   7.38929730e-03   1.52840833e-02   1.17254508e-02   4.00583871e-03
   2.24838652e-03   8.73138518e-05   8.64359359e-03   1.86232908e-06
   1.34819531e-02   1.19863156e-02   2.40050594e-05   9.38387138e-03
   7.51014406e-03   1.08885851e-03   1.67479741e-04   4.64248888e-06
   3.14279024e-05   2.30591406e-02   1.41216087e-03   8.83976179e-05
   0.00000000e+00   3.74683575e-07   9.02443469e-03   4.35135523e-03]
In [326]:
# What are the highest ranking X variables according to ExtraTreeClassifier?

print("Features sorted by their rank:")
print(sorted(zip(map(lambda x: round(x, 4), model_extra.feature_importances_), X.columns)))
Features sorted by their rank:
[(0.0, 'acetohexamide'), (0.0, 'metformin.pioglitazone'), (0.0, 'metformin.rosiglitazone'), (0.0, 'tolazamide'), (0.0, 'tolbutamide'), (0.0, 'troglitazone'), (0.0001, 'chlorpropamide'), (0.0001, 'glipizide.metformin'), (0.00020000000000000001, 'miglitol'), (0.0011000000000000001, 'acarbose'), (0.0014, 'glyburide.metformin'), (0.0022000000000000001, 'nateglinide'), (0.0040000000000000001, 'repaglinide'), (0.0044000000000000003, 'diabetesMed'), (0.0074000000000000003, 'max_glu_serum'), (0.0074999999999999997, 'rosiglitazone'), (0.0074999999999999997, 'weight'), (0.0086, 'glimepiride'), (0.0089999999999999993, 'change'), (0.0094000000000000004, 'pioglitazone'), (0.0117, 'metformin'), (0.012, 'glyburide'), (0.0135, 'glipizide'), (0.015299999999999999, 'A1Cresult'), (0.023099999999999999, 'insulin'), (0.0246, 'number_emergency'), (0.025399999999999999, 'gender'), (0.0292, 'race'), (0.029399999999999999, 'number_outpatient'), (0.0298, 'admission_source_id'), (0.033599999999999998, 'admission_type_id'), (0.041799999999999997, 'MED_SPEC_NUM'), (0.042799999999999998, 'payer_code'), (0.043700000000000003, 'number_diagnoses'), (0.043799999999999999, 'num_procedures'), (0.0458, 'discharge_disposition_id'), (0.053699999999999998, 'age'), (0.054600000000000003, 'number_inpatient'), (0.0567, 'time_in_hospital'), (0.0579, 'DIAG_CAT_1'), (0.059200000000000003, 'DIAG_CAT_2'), (0.059200000000000003, 'DIAG_CAT_3'), (0.064600000000000005, 'num_medications'), (0.066000000000000003, 'num_lab_procedures')]

All facotrs with less than 0.005 (rounded) will be removed

  • 'repaglinide' at 0.0041 and less are dropped
  • 'diabetesMed' at 0.0049 and above are kept
In [327]:
#drop or remove the column 'ID' since this column is not used in the analysis and disply the result
df = df.drop('acetohexamide', axis=1)
df = df.drop('metformin.pioglitazone', axis=1)
df = df.drop('metformin.rosiglitazone', axis=1)
df = df.drop('tolazamide', axis=1)

df = df.drop('tolbutamide', axis=1)
df = df.drop('troglitazone', axis=1)
df = df.drop('chlorpropamide', axis=1)
df = df.drop('glipizide.metformin', axis=1)
df = df.drop('miglitol', axis=1)
df = df.drop('acarbose', axis=1)
df = df.drop('glyburide.metformin', axis=1)
df = df.drop('nateglinide', axis=1)
df = df.drop('repaglinide', axis=1)


df.head(12)
Out[327]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin glimepiride glipizide glyburide pioglitazone rosiglitazone insulin change diabetesMed readm2
0 3 0 8 0 3 1 4 5 0 0 39 3 11 0 0 0 10 4 18 7 0 0 0 0 0 2 0 0 0 0 1 0
1 3 0 7 0 5 3 1 6 7 0 79 1 25 3 0 0 16 13 16 9 0 0 2 0 0 2 0 0 3 1 1 0
2 5 0 6 0 1 22 7 4 14 18 29 2 18 0 0 1 24 18 2 9 0 0 0 0 0 0 0 0 2 0 1 0
3 3 1 7 0 1 1 7 3 0 18 72 3 18 0 0 0 17 4 3 9 0 0 0 0 0 0 0 0 2 0 1 0
4 3 0 3 0 2 1 1 3 0 0 21 1 6 0 0 0 23 18 32 9 0 0 0 0 0 0 0 0 0 0 0 0
5 3 1 7 0 2 1 1 2 0 18 4 0 7 0 0 0 14 9 3 8 0 0 0 0 0 0 0 0 0 0 0 0
6 3 0 4 0 1 1 7 6 14 33 89 0 25 0 2 1 25 10 16 9 0 0 2 0 0 0 0 0 2 1 1 0
7 3 0 7 0 1 6 7 4 6 0 63 0 22 0 2 4 16 3 3 5 0 0 2 0 0 0 0 0 0 0 1 0
8 3 1 6 0 1 1 7 6 7 0 45 0 24 0 0 0 16 9 3 7 0 3 2 0 0 0 2 0 1 1 1 0
9 3 1 8 0 1 1 7 2 3 0 45 0 13 0 0 0 17 3 3 9 0 0 0 0 2 0 0 0 1 1 1 0
10 3 1 6 0 2 1 1 3 7 0 57 6 21 0 0 0 12 10 4 9 0 0 0 0 0 0 0 0 2 0 1 0
11 3 0 4 0 6 1 17 6 0 18 81 0 26 0 0 0 13 21 21 9 0 3 0 0 2 0 2 0 0 1 1 1
In [328]:
#general info
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56000 entries, 0 to 55999
Data columns (total 32 columns):
race                        56000 non-null int64
gender                      56000 non-null int64
age                         56000 non-null int64
weight                      56000 non-null int64
admission_type_id           56000 non-null int64
discharge_disposition_id    56000 non-null int64
admission_source_id         56000 non-null int64
time_in_hospital            56000 non-null int64
payer_code                  56000 non-null int64
MED_SPEC_NUM                56000 non-null int64
num_lab_procedures          56000 non-null int64
num_procedures              56000 non-null int64
num_medications             56000 non-null int64
number_outpatient           56000 non-null int64
number_emergency            56000 non-null int64
number_inpatient            56000 non-null int64
DIAG_CAT_1                  56000 non-null int64
DIAG_CAT_2                  56000 non-null int64
DIAG_CAT_3                  56000 non-null int64
number_diagnoses            56000 non-null int64
max_glu_serum               56000 non-null int64
A1Cresult                   56000 non-null int64
metformin                   56000 non-null int64
glimepiride                 56000 non-null int64
glipizide                   56000 non-null int64
glyburide                   56000 non-null int64
pioglitazone                56000 non-null int64
rosiglitazone               56000 non-null int64
insulin                     56000 non-null int64
change                      56000 non-null int64
diabetesMed                 56000 non-null int64
readm2                      56000 non-null int64
dtypes: int64(32)
memory usage: 13.7 MB
In [329]:
#basic statistics
df.describe()
Out[329]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin glimepiride glipizide glyburide pioglitazone rosiglitazone insulin change diabetesMed readm2
count 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000 56000.000000
mean 2.602071 0.464464 6.096589 0.123946 2.016893 3.721821 5.756643 4.398161 4.369375 10.668643 43.141661 1.335893 16.009268 0.367321 0.196875 0.637054 14.213321 12.011054 11.357411 7.423750 0.092089 0.368375 0.398875 0.102536 0.254732 0.210071 0.146161 0.125821 1.058839 0.462679 0.769821 0.112232
std 0.937754 0.498740 1.590761 0.712004 1.438340 5.291517 4.053838 2.984346 4.363828 15.595799 19.656507 1.702009 8.132455 1.249570 0.916820 1.270768 7.272908 7.443902 8.157131 1.931488 0.431655 0.890972 0.815169 0.449274 0.678992 0.627625 0.525985 0.490002 1.102484 0.498610 0.420951 0.315655
min 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 0.000000 0.000000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 3.000000 0.000000 5.000000 0.000000 1.000000 1.000000 1.000000 2.000000 0.000000 0.000000 32.000000 0.000000 10.000000 0.000000 0.000000 0.000000 10.000000 4.000000 3.000000 6.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000
50% 3.000000 0.000000 6.000000 0.000000 1.000000 1.000000 7.000000 4.000000 6.000000 4.000000 44.000000 1.000000 15.000000 0.000000 0.000000 0.000000 15.000000 12.000000 10.000000 8.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 1.000000 0.000000
75% 3.000000 1.000000 7.000000 0.000000 3.000000 4.000000 7.000000 6.000000 7.000000 18.000000 57.000000 2.000000 20.000000 0.000000 0.000000 1.000000 18.000000 17.000000 17.000000 9.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2.000000 1.000000 1.000000 0.000000
max 5.000000 1.000000 9.000000 9.000000 8.000000 28.000000 25.000000 14.000000 16.000000 63.000000 132.000000 6.000000 75.000000 42.000000 76.000000 21.000000 32.000000 32.000000 32.000000 16.000000 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 1.000000 1.000000 1.000000
In [330]:
#correlation analysis
df.corr()
Out[330]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin glimepiride glipizide glyburide pioglitazone rosiglitazone insulin change diabetesMed readm2
race 1.000000 0.061706 0.114255 0.040520 0.096587 0.005805 0.033113 -0.020364 0.041640 -0.030777 -0.023193 0.024391 0.022157 0.050845 -0.012812 -0.006053 0.042924 0.029594 0.016000 0.081672 0.054576 -0.013318 0.010548 0.008261 0.018551 0.015784 0.026105 0.005938 -0.039862 0.008300 -0.004537 0.008626
gender 0.061706 1.000000 -0.048579 0.014491 0.014578 -0.019566 -0.005222 -0.031088 0.000833 0.016623 -0.004968 0.061668 -0.023819 -0.005846 -0.024202 -0.013405 -0.034311 0.008083 0.008343 -0.007818 -0.001347 0.016539 0.001549 -0.000156 0.026810 0.034631 0.002339 0.010843 0.000247 0.012476 0.015391 -0.003421
age 0.114255 -0.048579 1.000000 0.005716 -0.005747 0.113970 0.041070 0.107273 0.058032 -0.068202 0.025665 -0.028360 0.039010 0.029064 -0.089149 -0.047012 0.091837 0.077541 0.052021 0.243515 0.018618 -0.147559 -0.060696 0.044360 0.055867 0.076798 0.013860 0.003034 -0.079078 -0.037793 -0.025360 0.015077
weight 0.040520 0.014491 0.005716 1.000000 0.037503 -0.035383 0.003026 0.023652 0.047819 0.004630 0.090456 0.018693 0.011274 0.104440 0.003706 -0.009154 0.023982 0.031824 0.014000 0.054391 -0.037139 -0.021109 0.007304 0.013694 0.017062 0.008707 0.026059 0.004232 -0.076697 -0.041219 -0.030585 -0.001510
admission_type_id 0.096587 0.014578 -0.005747 0.037503 1.000000 0.085986 0.098007 -0.014285 -0.136863 0.185351 -0.145869 0.131923 0.075711 0.030746 -0.018190 -0.032648 0.032151 -0.005648 -0.008918 -0.113991 0.352793 -0.043929 0.008631 -0.003178 0.007991 -0.002804 0.018570 0.022930 -0.025368 0.003992 -0.003930 -0.007755
discharge_disposition_id 0.005805 -0.019566 0.113970 -0.035383 0.085986 1.000000 0.016614 0.161954 -0.123220 -0.024028 0.022906 0.015536 0.105415 -0.006101 -0.024692 0.019240 0.034616 0.029774 0.024778 0.049496 0.037086 -0.020713 -0.008376 -0.022360 -0.013379 0.048256 -0.014116 -0.001694 -0.041842 -0.014047 -0.029452 0.051471
admission_source_id 0.033113 -0.005222 0.041070 0.003026 0.098007 0.016614 1.000000 -0.006996 -0.100157 -0.152760 0.046823 -0.137044 -0.055016 0.028833 0.061938 0.033697 -0.007753 -0.019796 0.001447 0.076318 0.412356 0.006512 -0.033283 -0.026685 0.009300 0.004919 -0.005729 -0.008894 0.005094 0.002583 0.000535 0.005589
time_in_hospital -0.020364 -0.031088 0.107273 0.023652 -0.014285 0.161954 -0.006996 1.000000 -0.037805 0.023146 0.318234 0.193139 0.468752 -0.003410 -0.005467 0.079929 -0.019913 0.086503 0.068677 0.224265 0.029079 0.058088 -0.009071 0.016086 0.016737 0.023482 0.008521 0.008531 0.101223 0.112359 0.059464 0.041582
payer_code 0.041640 0.000833 0.058032 0.047819 -0.136863 -0.123220 -0.100157 -0.037805 1.000000 -0.082746 -0.049680 -0.047581 0.005658 0.062572 0.067316 0.009598 0.008458 0.036335 0.033135 0.076424 -0.095739 -0.006824 0.027596 0.038055 0.005875 -0.047599 0.034867 -0.008782 0.115265 0.121010 0.077597 -0.007707
MED_SPEC_NUM -0.030777 0.016623 -0.068202 0.004630 0.185351 -0.024028 -0.152760 0.023146 -0.082746 1.000000 -0.068863 0.076952 0.036943 -0.051445 -0.009879 -0.013909 0.018820 -0.019354 -0.015192 -0.176693 -0.003316 -0.009813 0.023068 0.012798 0.007273 -0.005929 0.002210 0.016639 -0.014342 -0.005111 -0.002299 -0.014395
num_lab_procedures -0.023193 -0.004968 0.025665 0.090456 -0.145869 0.022906 0.046823 0.318234 -0.049680 -0.068863 1.000000 0.055081 0.267707 -0.008437 0.000613 0.037763 -0.071046 0.011204 0.011021 0.149116 -0.124907 0.236383 -0.044042 0.005344 0.012450 -0.001768 -0.015599 -0.010260 0.085401 0.062801 0.030903 0.018358
num_procedures 0.024391 0.061668 -0.028360 0.018693 0.131923 0.015536 -0.137044 0.193139 -0.047581 0.076952 0.055081 1.000000 0.387685 -0.028257 -0.033659 -0.061114 -0.056866 0.036607 0.025920 0.074394 -0.069910 -0.017477 -0.038122 0.007223 0.004999 0.001531 0.016471 0.018742 0.015020 0.005976 -0.009904 -0.013332
num_medications 0.022157 -0.023819 0.039010 0.011274 0.075711 0.105415 -0.055016 0.468752 0.005658 0.036943 0.267707 0.387685 1.000000 0.047313 0.017129 0.066793 0.004288 0.084268 0.063166 0.263311 0.001639 0.013044 0.069433 0.045223 0.056985 0.030886 0.071584 0.052860 0.198963 0.248529 0.186247 0.032318
number_outpatient 0.050845 -0.005846 0.029064 0.104440 0.030746 -0.006101 0.028833 -0.003410 0.062572 -0.051445 -0.008437 -0.028257 0.047313 1.000000 0.087824 0.103471 -0.009347 0.028015 0.026595 0.093518 0.054949 -0.024324 -0.013006 -0.009039 0.010527 -0.000482 0.012212 -0.001550 0.010029 0.027105 0.017340 0.017448
number_emergency -0.012812 -0.024202 -0.089149 0.003706 -0.018190 -0.024692 0.061938 -0.005467 0.067316 -0.009879 0.000613 -0.033659 0.017129 0.087824 1.000000 0.279626 -0.023803 -0.004155 0.007427 0.059398 0.035679 -0.004270 -0.009572 0.003318 -0.003426 -0.027870 -0.001978 -0.006844 0.048501 0.041797 0.029415 0.053723
number_inpatient -0.006053 -0.013405 -0.047012 -0.009154 -0.032648 0.019240 0.033697 0.079929 0.009598 -0.013909 0.037763 -0.061114 0.066793 0.103471 0.279626 1.000000 -0.004620 0.024244 0.032150 0.102473 0.038503 -0.049379 -0.073780 -0.016545 -0.022736 -0.036659 -0.026804 -0.021471 0.060505 0.025420 0.025559 0.162676
DIAG_CAT_1 0.042924 -0.034311 0.091837 0.023982 0.032151 0.034616 -0.007753 -0.019913 0.008458 0.018820 -0.071046 -0.056866 0.004288 -0.009347 -0.023803 -0.004620 1.000000 0.025858 0.028021 0.046451 -0.016030 -0.091392 0.033199 0.000410 0.010541 0.017872 0.024890 0.010041 -0.075260 -0.033688 -0.028985 -0.006205
DIAG_CAT_2 0.029594 0.008083 0.077541 0.031824 -0.005648 0.029774 -0.019796 0.086503 0.036335 -0.019354 0.011204 0.036607 0.084268 0.028015 -0.004155 0.024244 0.025858 1.000000 0.081391 0.171521 -0.017962 -0.044930 -0.018313 0.006773 0.004223 0.010435 0.000030 -0.010618 -0.007776 -0.006439 -0.010210 0.004473
DIAG_CAT_3 0.016000 0.008343 0.052021 0.014000 -0.008918 0.024778 0.001447 0.068677 0.033135 -0.015192 0.011021 0.025920 0.063166 0.026595 0.007427 0.032150 0.028021 0.081391 1.000000 0.186667 -0.009693 -0.031716 -0.024179 -0.010677 -0.005554 -0.005157 -0.008180 -0.003303 0.013942 0.005824 -0.007452 0.012287
number_diagnoses 0.081672 -0.007818 0.243515 0.054391 -0.113991 0.049496 0.076318 0.224265 0.076424 -0.176693 0.149116 0.074394 0.263311 0.093518 0.059398 0.102473 0.046451 0.171521 0.186667 1.000000 -0.036161 -0.032983 -0.073736 0.013640 -0.005975 -0.024247 0.002278 -0.011524 0.076730 0.055250 0.019375 0.045801
max_glu_serum 0.054576 -0.001347 0.018618 -0.037139 0.352793 0.037086 0.412356 0.029079 -0.095739 -0.003316 -0.124907 -0.069910 0.001639 0.054949 0.035679 0.038503 -0.016030 -0.017962 -0.009693 -0.036161 1.000000 -0.043540 -0.029790 -0.031840 0.005931 0.000373 -0.014531 -0.009275 0.000884 0.008958 -0.005206 0.011038
A1Cresult -0.013318 0.016539 -0.147559 -0.021109 -0.043929 -0.020713 0.006512 0.058088 -0.006824 -0.009813 0.236383 -0.017477 0.013044 -0.024324 -0.004270 -0.049379 -0.091392 -0.044930 -0.031716 -0.032983 -0.043540 1.000000 0.051894 0.022787 0.020844 0.009977 0.000223 0.009548 0.107227 0.105614 0.086291 -0.014302
metformin 0.010548 0.001549 -0.060696 0.007304 0.008631 -0.008376 -0.033283 -0.009071 0.027596 0.023068 -0.044042 -0.038122 0.069433 -0.013006 -0.009572 -0.073780 0.033199 -0.018313 -0.024179 -0.073736 -0.029790 0.051894 1.000000 0.047475 0.077111 0.129061 0.060566 0.097708 -0.017392 0.325302 0.267566 -0.027269
glimepiride 0.008261 -0.000156 0.044360 0.013694 -0.003178 -0.022360 -0.026685 0.016086 0.038055 0.012798 0.005344 0.007223 0.045223 -0.009039 0.003318 -0.016545 0.000410 0.006773 -0.010677 0.013640 -0.031840 0.022787 0.047475 1.000000 -0.071983 -0.067334 0.042601 0.038655 0.012479 0.138970 0.124797 -0.001818
glipizide 0.018551 0.026810 0.055867 0.017062 0.007991 -0.013379 0.009300 0.016737 0.005875 0.007273 0.012450 0.004999 0.056985 0.010527 -0.003426 -0.022736 0.010541 0.004223 -0.005554 -0.005975 0.005931 0.020844 0.077111 -0.071983 1.000000 -0.104495 0.049752 0.041498 -0.027179 0.194260 0.205145 0.003333
glyburide 0.015784 0.034631 0.076798 0.008707 -0.002804 0.048256 0.004919 0.023482 -0.047599 -0.005929 -0.001768 0.001531 0.030886 -0.000482 -0.027870 -0.036659 0.017872 0.010435 -0.005157 -0.024247 0.000373 0.009977 0.129061 -0.067334 -0.104495 1.000000 0.027727 0.030766 -0.071853 0.172392 0.183024 -0.004985
pioglitazone 0.026105 0.002339 0.013860 0.026059 0.018570 -0.014116 -0.005729 0.008521 0.034867 0.002210 -0.015599 0.016471 0.071584 0.012212 -0.001978 -0.026804 0.024890 0.000030 -0.008180 0.002278 -0.014531 0.000223 0.060566 0.042601 0.049752 0.027727 1.000000 -0.062763 0.003954 0.203180 0.151949 -0.005122
rosiglitazone 0.005938 0.010843 0.003034 0.004232 0.022930 -0.001694 -0.008894 0.008531 -0.008782 0.016639 -0.010260 0.018742 0.052860 -0.001550 -0.006844 -0.021471 0.010041 -0.010618 -0.003303 -0.011524 -0.009275 0.009548 0.097708 0.038655 0.041498 0.030766 -0.062763 1.000000 0.004080 0.191641 0.140410 -0.009096
insulin -0.039862 0.000247 -0.079078 -0.076697 -0.025368 -0.041842 0.005094 0.101223 0.115265 -0.014342 0.085401 0.015020 0.198963 0.010029 0.048501 0.060505 -0.075260 -0.007776 0.013942 0.076730 0.000884 0.107227 -0.017392 0.012479 -0.027179 -0.071853 0.003954 0.004080 1.000000 0.461502 0.525169 0.024743
change 0.008300 0.012476 -0.037793 -0.041219 0.003992 -0.014047 0.002583 0.112359 0.121010 -0.005111 0.062801 0.005976 0.248529 0.027105 0.041797 0.025420 -0.033688 -0.006439 0.005824 0.055250 0.008958 0.105614 0.325302 0.138970 0.194260 0.172392 0.203180 0.191641 0.461502 1.000000 0.507411 0.018728
diabetesMed -0.004537 0.015391 -0.025360 -0.030585 -0.003930 -0.029452 0.000535 0.059464 0.077597 -0.002299 0.030903 -0.009904 0.186247 0.017340 0.029415 0.025559 -0.028985 -0.010210 -0.007452 0.019375 -0.005206 0.086291 0.267566 0.124797 0.205145 0.183024 0.151949 0.140410 0.525169 0.507411 1.000000 0.024819
readm2 0.008626 -0.003421 0.015077 -0.001510 -0.007755 0.051471 0.005589 0.041582 -0.007707 -0.014395 0.018358 -0.013332 0.032318 0.017448 0.053723 0.162676 -0.006205 0.004473 0.012287 0.045801 0.011038 -0.014302 -0.027269 -0.001818 0.003333 -0.004985 -0.005122 -0.009096 0.024743 0.018728 0.024819 1.000000
In [331]:
# heatmap for correlation
plt.figure(figsize=(35,36))
sns.heatmap(df.corr(), annot=True)
Out[331]:
<matplotlib.axes._subplots.AxesSubplot at 0x4c30fc50>

Decision Tree Model Building, Validation, Evaluation

  • Remember the model should be "simple, but not too simple"

Going to Split Data into two Different Sets

  • Training Set
  • Test Set
In [332]:
# Set Y and X

y = df['readm2']
X = df.drop(['readm2'], axis=1)
X.head(12)
Out[332]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin glimepiride glipizide glyburide pioglitazone rosiglitazone insulin change diabetesMed
0 3 0 8 0 3 1 4 5 0 0 39 3 11 0 0 0 10 4 18 7 0 0 0 0 0 2 0 0 0 0 1
1 3 0 7 0 5 3 1 6 7 0 79 1 25 3 0 0 16 13 16 9 0 0 2 0 0 2 0 0 3 1 1
2 5 0 6 0 1 22 7 4 14 18 29 2 18 0 0 1 24 18 2 9 0 0 0 0 0 0 0 0 2 0 1
3 3 1 7 0 1 1 7 3 0 18 72 3 18 0 0 0 17 4 3 9 0 0 0 0 0 0 0 0 2 0 1
4 3 0 3 0 2 1 1 3 0 0 21 1 6 0 0 0 23 18 32 9 0 0 0 0 0 0 0 0 0 0 0
5 3 1 7 0 2 1 1 2 0 18 4 0 7 0 0 0 14 9 3 8 0 0 0 0 0 0 0 0 0 0 0
6 3 0 4 0 1 1 7 6 14 33 89 0 25 0 2 1 25 10 16 9 0 0 2 0 0 0 0 0 2 1 1
7 3 0 7 0 1 6 7 4 6 0 63 0 22 0 2 4 16 3 3 5 0 0 2 0 0 0 0 0 0 0 1
8 3 1 6 0 1 1 7 6 7 0 45 0 24 0 0 0 16 9 3 7 0 3 2 0 0 0 2 0 1 1 1
9 3 1 8 0 1 1 7 2 3 0 45 0 13 0 0 0 17 3 3 9 0 0 0 0 2 0 0 0 1 1 1
10 3 1 6 0 2 1 1 3 7 0 57 6 21 0 0 0 12 10 4 9 0 0 0 0 0 0 0 0 2 0 1
11 3 0 4 0 6 1 17 6 0 18 81 0 26 0 0 0 13 21 21 9 0 3 0 0 2 0 2 0 0 1 1
In [333]:
# evaluate the model by splitting into train (70%) and test sets (30%)
# http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

# name the model as "dt"

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
Out[333]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
In [334]:
#Model evaluation
# http://scikit-learn.org/stable/modules/model_evaluation.html

print(metrics.accuracy_score(y_test, dt.predict(X_test)))
print(metrics.confusion_matrix(y_test, dt.predict(X_test)))
print(metrics.classification_report(y_test, dt.predict(X_test)))
print(metrics.roc_auc_score(y_test, dt.predict(X_test)))

# y-test is the acual y value in the testing dataset
# dt.predict(X_test) is the y value generated by your model
# If they are same, we can say your model is accurate.
0.796428571429
[[13017  1937]
 [ 1483   363]]
             precision    recall  f1-score   support

          0       0.90      0.87      0.88     14954
          1       0.16      0.20      0.18      1846

avg / total       0.82      0.80      0.81     16800

0.533555413199

Question: Interpret the results of confusion matrix

  • 13032 correctly classified as those who will not be readmitted.
  • 1922 misclassified as those who will be readmitted, but actually will not be readmitted
  • 299 correctly classified as those who will be readmitted
  • 1547 misclassified as those who will not be readmitted, but actually will be readmitted
    - Model accuracy would therefore be calculated as:
       - (13012+383) / (13012+1942+1463+383) = 13395/16800 = 0.7973 ==> Expect to be 79.73% accurate when this model is applied to real-world situation.

Visualizing decision tree

  • There are two methods for this. You can use either method.
  • Using Graphviz software. For this option, you need to have GraphViz installed in your mahcine.
In [335]:
# Graphviz
tree.export_graphviz(dt, out_file='data/decisiontree.dot', feature_names=X.columns)
In [336]:
# This is a "full-grown" tree 
from IPython.display import Image
Image("data/decisiontree.png")
Out[336]:

Interpreting decision tree

  • Practical Sized Decision Tree
In [337]:
# Set Y and X

y = df['readm2']
X = df.drop(['readm2'], axis=1)
In [338]:
#max_depth = 5 ... otherwise you will get a full-grown tree, which is overfitting

# You can make a simpler decision tree

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
dt_simple = DecisionTreeClassifier(max_depth=3, min_samples_leaf=5)
dt_simple.fit(X_train, y_train)

# max_depth : The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

# min_samples_leaf : The minimum number of samples required to be at a leaf node

# http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
Out[338]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=5,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
In [339]:
# Find out the performance of this model & interpret the results

print(metrics.accuracy_score(y_test, dt_simple.predict(X_test)))
print(metrics.confusion_matrix(y_test, dt_simple.predict(X_test)))
print(metrics.classification_report(y_test, dt_simple.predict(X_test)))
print(metrics.roc_auc_score(y_test, dt_simple.predict(X_test)))
0.889761904762
[[14946     8]
 [ 1844     2]]
             precision    recall  f1-score   support

          0       0.89      1.00      0.94     14954
          1       0.20      0.00      0.00      1846

avg / total       0.81      0.89      0.84     16800

0.500274224849
In [340]:
# Visualize the simpler decision tree model (dt_simple)

tree.export_graphviz(dt_simple, out_file='data/decisiontree_simple.dot', feature_names=X.columns)
In [341]:
# Embed decision tree

from IPython.display import Image
Image("data/decisiontree_simple.png")
Out[341]:

Model Deployment: Predict y values

  • load Challenge_1_Validation_Work.csv (scoring dataset).
  • This dataset has no y value, represeting the future.
  • Apply your decision model and find out who is likely to be readmitted.
In [342]:
#import scoring data

#no Y value in this dataset ... 
#we are trying to predict whether the people in this scoring dataset are likely to be readmitted <30 days or not

score = pd.read_csv('data/Challenge_1_Validation_Work.csv')
score.head(5)
Out[342]:
encounter_id patient_nbr race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code medical_specialty MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient diag_1 DIAG_CAT_1 diag_2 DIAG_CAT_2 diag_3 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed
0 166684116 25357527 Caucasian Male [40-50) ? 3 11 1 2 MC Nephrology 19 48 4 11 0 0 0 518 16 431 14 427 12 9 None None No No No No No No No No No No No No No No No No No No No No No No No No No
1 118577202 111860712 Caucasian Female [40-50) ? 1 1 7 2 BC ? 0 31 1 28 0 0 0 996 27 250 3 530 17 5 None None Steady No No No No No No Steady No No Steady No No No No No No Up No No No No No Ch Yes
2 44006898 20621907 Caucasian Male [60-70) ? 6 7 7 1 ? InternalMedicine 18 42 0 12 0 0 0 786 23 V42 32 250 3 3 >200 None No No No No No No No Steady No No No No No No No No No No No No No No No No Yes
3 55615428 110263914 Caucasian Female [70-80) ? 1 1 7 5 ? ? 0 52 2 25 1 1 0 820 24 427 12 428 13 9 None None No Steady No No No No No No No No No No No No No No No Steady No No No No No Ch Yes
4 201098010 96526827 Caucasian Female [40-50) ? 2 1 7 2 SP Surgery-General 55 41 2 3 0 0 0 455 15 535 17 211 2 9 None None No No No No No No No No No No No No No No No No No No No No No No No No No
In [343]:
#drop or remove the columns 'encounter_id', 'patient_nbr' since this column is not used in the analysis and disply the result
score = score.drop('encounter_id', axis=1)
score = score.drop('patient_nbr', axis=1)
score = score.drop('medical_specialty', axis=1)

# drop or remove the columns 'diag_1', 'diag_2' and 'diag_3' since these values of been put into catergories
# in columns 'DIAG_CAT_1', 'DIAG_CAT_2' and 'DIAG_CAT_3'
score = score.drop('diag_1', axis=1)
score = score.drop('diag_2', axis=1)
score = score.drop('diag_3', axis=1)

score.head(5)
Out[343]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed
0 Caucasian Male [40-50) ? 3 11 1 2 MC 19 48 4 11 0 0 0 16 14 12 9 None None No No No No No No No No No No No No No No No No No No No No No No No No No
1 Caucasian Female [40-50) ? 1 1 7 2 BC 0 31 1 28 0 0 0 27 3 17 5 None None Steady No No No No No No Steady No No Steady No No No No No No Up No No No No No Ch Yes
2 Caucasian Male [60-70) ? 6 7 7 1 ? 18 42 0 12 0 0 0 23 32 3 3 >200 None No No No No No No No Steady No No No No No No No No No No No No No No No No Yes
3 Caucasian Female [70-80) ? 1 1 7 5 ? 0 52 2 25 1 1 0 24 12 13 9 None None No Steady No No No No No No No No No No No No No No No Steady No No No No No Ch Yes
4 Caucasian Female [40-50) ? 2 1 7 2 SP 55 41 2 3 0 0 0 15 17 2 9 None None No No No No No No No No No No No No No No No No No No No No No No No No No
In [344]:
#replace the values of the 'race' column:
# ? = 0
# AfricanAmerican = 1
# Asian = 2
# Caucasion = 3
# Hispanic = 4
# Other = 5

score = score.replace({'race': {'?': 0, 'AfricanAmerican': 1, 'Asian': 2,'Caucasian': 3,'Hispanic': 4,'Other': 5}})

score.head(2)
Out[344]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed
0 3 Male [40-50) ? 3 11 1 2 MC 19 48 4 11 0 0 0 16 14 12 9 None None No No No No No No No No No No No No No No No No No No No No No No No No No
1 3 Female [40-50) ? 1 1 7 2 BC 0 31 1 28 0 0 0 27 3 17 5 None None Steady No No No No No No Steady No No Steady No No No No No No Up No No No No No Ch Yes
In [345]:
#replace the values of the 'gender' column:
# Female = 0
# Male = 1

score = score.replace({'gender': {'Male': 1, 'Female': 0}})

score.head(2)
Out[345]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed
0 3 1 [40-50) ? 3 11 1 2 MC 19 48 4 11 0 0 0 16 14 12 9 None None No No No No No No No No No No No No No No No No No No No No No No No No No
1 3 0 [40-50) ? 1 1 7 2 BC 0 31 1 28 0 0 0 27 3 17 5 None None Steady No No No No No No Steady No No Steady No No No No No No Up No No No No No Ch Yes
In [346]:
#replace the values of the 'age' column:
# [0-10) = 0
# [10-20) = 1
# [20-30) = 2
# [30-40) = 3
# [40-50) = 4
# [50-60) = 5
# [60-70) = 6
# [70-80) = 7
# [80-90) = 8
# [90-100) = 9

score = score.replace({'age': {'[0-10)': 0, '[10-20)': 1, '[20-30)': 2, '[30-40)': 3, '[40-50)': 4, '[50-60)': 5, '[60-70)': 6, '[70-80)': 7, '[80-90)': 8, '[90-100)': 9}})

score.head(2)
Out[346]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed
0 3 1 4 ? 3 11 1 2 MC 19 48 4 11 0 0 0 16 14 12 9 None None No No No No No No No No No No No No No No No No No No No No No No No No No
1 3 0 4 ? 1 1 7 2 BC 0 31 1 28 0 0 0 27 3 17 5 None None Steady No No No No No No Steady No No Steady No No No No No No Up No No No No No Ch Yes
In [347]:
#replace the values of the 'weight' column:
# ? = 0
# [0-25) = 1
# [25-50) = 2
# [50-75) = 3
# [75-100) = 4
# [100-125) = 5
# [125-150) = 6
# [150-175) = 7
# [175-200) = 8
# > 200 = 9

score = score.replace({'weight': {'?': 0, '[0-25)': 1, '[25-50)': 2, '[50-75)': 3, '[75-100)': 4, '[100-125)': 5, '[125-150)': 6, '[150-175)': 7, '[175-200)': 8, '>200': 9}})

score.head(2)
Out[347]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed
0 3 1 4 0 3 11 1 2 MC 19 48 4 11 0 0 0 16 14 12 9 None None No No No No No No No No No No No No No No No No No No No No No No No No No
1 3 0 4 0 1 1 7 2 BC 0 31 1 28 0 0 0 27 3 17 5 None None Steady No No No No No No Steady No No Steady No No No No No No Up No No No No No Ch Yes
In [348]:
#replace the values of the 'payer_code' column:

score = score.replace({'payer_code': {'?': 0, 'BC': 1, 'CH': 2, 'CM': 3, 'CP': 4, 'DM': 5, 'HM': 6, 'MC': 7, 'MD': 8, 'MP': 9, 'OG': 10, 'OT': 11, 'PO': 12, 'SI': 13, 'SP': 14, 'UN': 15, 'WC': 16, 'FR': 19}})

score.head(2)
Out[348]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed
0 3 1 4 0 3 11 1 2 7 19 48 4 11 0 0 0 16 14 12 9 None None No No No No No No No No No No No No No No No No No No No No No No No No No
1 3 0 4 0 1 1 7 2 1 0 31 1 28 0 0 0 27 3 17 5 None None Steady No No No No No No Steady No No Steady No No No No No No Up No No No No No Ch Yes
In [349]:
#distribution of payer_code categories in the payer_code column
score.groupby('payer_code').count()
Out[349]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed
payer_code
0 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513 5513
1 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668 668
2 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
3 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274 274
4 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315 315
5 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69
6 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906 906
7 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437 4437
8 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484 484
9 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14
10 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142 142
11 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15
12 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85
13 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
14 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669 669
15 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354 354
16 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28
19 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
In [350]:
score.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14000 entries, 0 to 13999
Data columns (total 47 columns):
race                        14000 non-null int64
gender                      14000 non-null int64
age                         14000 non-null int64
weight                      14000 non-null int64
admission_type_id           14000 non-null int64
discharge_disposition_id    14000 non-null int64
admission_source_id         14000 non-null int64
time_in_hospital            14000 non-null int64
payer_code                  14000 non-null int64
MED_SPEC_NUM                14000 non-null int64
num_lab_procedures          14000 non-null int64
num_procedures              14000 non-null int64
num_medications             14000 non-null int64
number_outpatient           14000 non-null int64
number_emergency            14000 non-null int64
number_inpatient            14000 non-null int64
DIAG_CAT_1                  14000 non-null int64
DIAG_CAT_2                  14000 non-null int64
DIAG_CAT_3                  14000 non-null int64
number_diagnoses            14000 non-null int64
max_glu_serum               14000 non-null object
A1Cresult                   14000 non-null object
metformin                   14000 non-null object
repaglinide                 14000 non-null object
nateglinide                 14000 non-null object
chlorpropamide              14000 non-null object
glimepiride                 14000 non-null object
acetohexamide               14000 non-null object
glipizide                   14000 non-null object
glyburide                   14000 non-null object
tolbutamide                 14000 non-null object
pioglitazone                14000 non-null object
rosiglitazone               14000 non-null object
acarbose                    14000 non-null object
miglitol                    14000 non-null object
troglitazone                14000 non-null object
tolazamide                  14000 non-null object
examide                     14000 non-null object
citoglipton                 14000 non-null object
insulin                     14000 non-null object
glyburide.metformin         14000 non-null object
glipizide.metformin         14000 non-null object
glimepiride.pioglitazone    14000 non-null object
metformin.rosiglitazone     14000 non-null object
metformin.pioglitazone      14000 non-null object
change                      14000 non-null object
diabetesMed                 14000 non-null object
dtypes: int64(20), object(27)
memory usage: 5.0+ MB
In [351]:
#replace the values of the 'max_glu_serum' column:
# None = 0
# Norm = 1
# >200 = 2
# >300 = 3

score = score.replace({'max_glu_serum': {'None': 0, 'Norm': 1, '>200': 2, '>300': 3}})

score.head(2)
Out[351]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed
0 3 1 4 0 3 11 1 2 7 19 48 4 11 0 0 0 16 14 12 9 0 None No No No No No No No No No No No No No No No No No No No No No No No No No
1 3 0 4 0 1 1 7 2 1 0 31 1 28 0 0 0 27 3 17 5 0 None Steady No No No No No No Steady No No Steady No No No No No No Up No No No No No Ch Yes
In [352]:
#replace the values of the 'A1Cresult' column:
# None = 0
# Norm = 1
# >7 = 2
# >8 = 3

score = score.replace({'A1Cresult': {'None': 0, 'Norm': 1, '>7': 2, '>8': 3}})

score.head(2)
Out[352]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed
0 3 1 4 0 3 11 1 2 7 19 48 4 11 0 0 0 16 14 12 9 0 0 No No No No No No No No No No No No No No No No No No No No No No No No No
1 3 0 4 0 1 1 7 2 1 0 31 1 28 0 0 0 27 3 17 5 0 0 Steady No No No No No No Steady No No Steady No No No No No No Up No No No No No Ch Yes
In [353]:
#replace the values of the 'change' column:
# No = 0
# Ch = 1

score = score.replace({'change': {'No': 0, 'Ch': 1}})

score.head(2)
Out[353]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed
0 3 1 4 0 3 11 1 2 7 19 48 4 11 0 0 0 16 14 12 9 0 0 No No No No No No No No No No No No No No No No No No No No No No No 0 No
1 3 0 4 0 1 1 7 2 1 0 31 1 28 0 0 0 27 3 17 5 0 0 Steady No No No No No No Steady No No Steady No No No No No No Up No No No No No 1 Yes
In [354]:
#replace the values of the 'diabetesMed' column:
# No = 0
# Yes = 1

score = score.replace({'diabetesMed': {'No': 0, 'Yes': 1,}})

score.head(2)
Out[354]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed
0 3 1 4 0 3 11 1 2 7 19 48 4 11 0 0 0 16 14 12 9 0 0 No No No No No No No No No No No No No No No No No No No No No No No 0 0
1 3 0 4 0 1 1 7 2 1 0 31 1 28 0 0 0 27 3 17 5 0 0 Steady No No No No No No Steady No No Steady No No No No No No Up No No No No No 1 1
In [355]:
#replace the values in the medicene column:
# No = 0
# Down = 1
# Steady = 2
# Up = 3

score = score.replace({'metformin': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
score = score.replace({'repaglinide': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
score = score.replace({'nateglinide': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
score = score.replace({'chlorpropamide': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
score = score.replace({'glimepiride': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
score = score.replace({'acetohexamide': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
score = score.replace({'glipizide': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
score = score.replace({'glyburide': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
score = score.replace({'tolbutamide': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
score = score.replace({'pioglitazone': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
score = score.replace({'rosiglitazone': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
score = score.replace({'acarbose': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
score = score.replace({'miglitol': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
score = score.replace({'troglitazone': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
score = score.replace({'tolazamide': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
score = score.replace({'examide': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
score = score.replace({'citoglipton': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
score = score.replace({'insulin': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
score = score.replace({'glyburide.metformin': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
score = score.replace({'glipizide.metformin': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
score = score.replace({'glimepiride.pioglitazone': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
score = score.replace({'metformin.rosiglitazone': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})
score = score.replace({'metformin.pioglitazone': {'No': 0, 'Down': 1, 'Steady': 2, 'Up': 3}})

score.head(2)
Out[355]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed
0 3 1 4 0 3 11 1 2 7 19 48 4 11 0 0 0 16 14 12 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 3 0 4 0 1 1 7 2 1 0 31 1 28 0 0 0 27 3 17 5 0 0 2 0 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0 3 0 0 0 0 0 1 1
In [356]:
# save converted data frame with only int to a new file
score_clean_NoString = score
In [357]:
#info
score_clean_NoString.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14000 entries, 0 to 13999
Data columns (total 47 columns):
race                        14000 non-null int64
gender                      14000 non-null int64
age                         14000 non-null int64
weight                      14000 non-null int64
admission_type_id           14000 non-null int64
discharge_disposition_id    14000 non-null int64
admission_source_id         14000 non-null int64
time_in_hospital            14000 non-null int64
payer_code                  14000 non-null int64
MED_SPEC_NUM                14000 non-null int64
num_lab_procedures          14000 non-null int64
num_procedures              14000 non-null int64
num_medications             14000 non-null int64
number_outpatient           14000 non-null int64
number_emergency            14000 non-null int64
number_inpatient            14000 non-null int64
DIAG_CAT_1                  14000 non-null int64
DIAG_CAT_2                  14000 non-null int64
DIAG_CAT_3                  14000 non-null int64
number_diagnoses            14000 non-null int64
max_glu_serum               14000 non-null int64
A1Cresult                   14000 non-null int64
metformin                   14000 non-null int64
repaglinide                 14000 non-null int64
nateglinide                 14000 non-null int64
chlorpropamide              14000 non-null int64
glimepiride                 14000 non-null int64
acetohexamide               14000 non-null int64
glipizide                   14000 non-null int64
glyburide                   14000 non-null int64
tolbutamide                 14000 non-null int64
pioglitazone                14000 non-null int64
rosiglitazone               14000 non-null int64
acarbose                    14000 non-null int64
miglitol                    14000 non-null int64
troglitazone                14000 non-null int64
tolazamide                  14000 non-null int64
examide                     14000 non-null int64
citoglipton                 14000 non-null int64
insulin                     14000 non-null int64
glyburide.metformin         14000 non-null int64
glipizide.metformin         14000 non-null int64
glimepiride.pioglitazone    14000 non-null int64
metformin.rosiglitazone     14000 non-null int64
metformin.pioglitazone      14000 non-null int64
change                      14000 non-null int64
diabetesMed                 14000 non-null int64
dtypes: int64(47)
memory usage: 5.0 MB
In [358]:
# write dataframe with no string values to new csv file
score_clean_NoString.to_csv('data/Challenge_1_Validation_Work_Clean_NoString.csv')
In [359]:
#score = pd.read_csv('data/Challenge_1_Validation_Work.csv')

Random Forest

In [360]:
score.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14000 entries, 0 to 13999
Data columns (total 47 columns):
race                        14000 non-null int64
gender                      14000 non-null int64
age                         14000 non-null int64
weight                      14000 non-null int64
admission_type_id           14000 non-null int64
discharge_disposition_id    14000 non-null int64
admission_source_id         14000 non-null int64
time_in_hospital            14000 non-null int64
payer_code                  14000 non-null int64
MED_SPEC_NUM                14000 non-null int64
num_lab_procedures          14000 non-null int64
num_procedures              14000 non-null int64
num_medications             14000 non-null int64
number_outpatient           14000 non-null int64
number_emergency            14000 non-null int64
number_inpatient            14000 non-null int64
DIAG_CAT_1                  14000 non-null int64
DIAG_CAT_2                  14000 non-null int64
DIAG_CAT_3                  14000 non-null int64
number_diagnoses            14000 non-null int64
max_glu_serum               14000 non-null int64
A1Cresult                   14000 non-null int64
metformin                   14000 non-null int64
repaglinide                 14000 non-null int64
nateglinide                 14000 non-null int64
chlorpropamide              14000 non-null int64
glimepiride                 14000 non-null int64
acetohexamide               14000 non-null int64
glipizide                   14000 non-null int64
glyburide                   14000 non-null int64
tolbutamide                 14000 non-null int64
pioglitazone                14000 non-null int64
rosiglitazone               14000 non-null int64
acarbose                    14000 non-null int64
miglitol                    14000 non-null int64
troglitazone                14000 non-null int64
tolazamide                  14000 non-null int64
examide                     14000 non-null int64
citoglipton                 14000 non-null int64
insulin                     14000 non-null int64
glyburide.metformin         14000 non-null int64
glipizide.metformin         14000 non-null int64
glimepiride.pioglitazone    14000 non-null int64
metformin.rosiglitazone     14000 non-null int64
metformin.pioglitazone      14000 non-null int64
change                      14000 non-null int64
diabetesMed                 14000 non-null int64
dtypes: int64(47)
memory usage: 5.0 MB
In [361]:
# Set Y and X

y = df['readm2']
X = df.drop(['readm2'], axis=1)
In [362]:
X.head(12)
Out[362]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin glimepiride glipizide glyburide pioglitazone rosiglitazone insulin change diabetesMed
0 3 0 8 0 3 1 4 5 0 0 39 3 11 0 0 0 10 4 18 7 0 0 0 0 0 2 0 0 0 0 1
1 3 0 7 0 5 3 1 6 7 0 79 1 25 3 0 0 16 13 16 9 0 0 2 0 0 2 0 0 3 1 1
2 5 0 6 0 1 22 7 4 14 18 29 2 18 0 0 1 24 18 2 9 0 0 0 0 0 0 0 0 2 0 1
3 3 1 7 0 1 1 7 3 0 18 72 3 18 0 0 0 17 4 3 9 0 0 0 0 0 0 0 0 2 0 1
4 3 0 3 0 2 1 1 3 0 0 21 1 6 0 0 0 23 18 32 9 0 0 0 0 0 0 0 0 0 0 0
5 3 1 7 0 2 1 1 2 0 18 4 0 7 0 0 0 14 9 3 8 0 0 0 0 0 0 0 0 0 0 0
6 3 0 4 0 1 1 7 6 14 33 89 0 25 0 2 1 25 10 16 9 0 0 2 0 0 0 0 0 2 1 1
7 3 0 7 0 1 6 7 4 6 0 63 0 22 0 2 4 16 3 3 5 0 0 2 0 0 0 0 0 0 0 1
8 3 1 6 0 1 1 7 6 7 0 45 0 24 0 0 0 16 9 3 7 0 3 2 0 0 0 2 0 1 1 1
9 3 1 8 0 1 1 7 2 3 0 45 0 13 0 0 0 17 3 3 9 0 0 0 0 2 0 0 0 1 1 1
10 3 1 6 0 2 1 1 3 7 0 57 6 21 0 0 0 12 10 4 9 0 0 0 0 0 0 0 0 2 0 1
11 3 0 4 0 6 1 17 6 0 18 81 0 26 0 0 0 13 21 21 9 0 3 0 0 2 0 2 0 0 1 1
In [363]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56000 entries, 0 to 55999
Data columns (total 32 columns):
race                        56000 non-null int64
gender                      56000 non-null int64
age                         56000 non-null int64
weight                      56000 non-null int64
admission_type_id           56000 non-null int64
discharge_disposition_id    56000 non-null int64
admission_source_id         56000 non-null int64
time_in_hospital            56000 non-null int64
payer_code                  56000 non-null int64
MED_SPEC_NUM                56000 non-null int64
num_lab_procedures          56000 non-null int64
num_procedures              56000 non-null int64
num_medications             56000 non-null int64
number_outpatient           56000 non-null int64
number_emergency            56000 non-null int64
number_inpatient            56000 non-null int64
DIAG_CAT_1                  56000 non-null int64
DIAG_CAT_2                  56000 non-null int64
DIAG_CAT_3                  56000 non-null int64
number_diagnoses            56000 non-null int64
max_glu_serum               56000 non-null int64
A1Cresult                   56000 non-null int64
metformin                   56000 non-null int64
glimepiride                 56000 non-null int64
glipizide                   56000 non-null int64
glyburide                   56000 non-null int64
pioglitazone                56000 non-null int64
rosiglitazone               56000 non-null int64
insulin                     56000 non-null int64
change                      56000 non-null int64
diabetesMed                 56000 non-null int64
readm2                      56000 non-null int64
dtypes: int64(32)
memory usage: 13.7 MB
In [364]:
score.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14000 entries, 0 to 13999
Data columns (total 47 columns):
race                        14000 non-null int64
gender                      14000 non-null int64
age                         14000 non-null int64
weight                      14000 non-null int64
admission_type_id           14000 non-null int64
discharge_disposition_id    14000 non-null int64
admission_source_id         14000 non-null int64
time_in_hospital            14000 non-null int64
payer_code                  14000 non-null int64
MED_SPEC_NUM                14000 non-null int64
num_lab_procedures          14000 non-null int64
num_procedures              14000 non-null int64
num_medications             14000 non-null int64
number_outpatient           14000 non-null int64
number_emergency            14000 non-null int64
number_inpatient            14000 non-null int64
DIAG_CAT_1                  14000 non-null int64
DIAG_CAT_2                  14000 non-null int64
DIAG_CAT_3                  14000 non-null int64
number_diagnoses            14000 non-null int64
max_glu_serum               14000 non-null int64
A1Cresult                   14000 non-null int64
metformin                   14000 non-null int64
repaglinide                 14000 non-null int64
nateglinide                 14000 non-null int64
chlorpropamide              14000 non-null int64
glimepiride                 14000 non-null int64
acetohexamide               14000 non-null int64
glipizide                   14000 non-null int64
glyburide                   14000 non-null int64
tolbutamide                 14000 non-null int64
pioglitazone                14000 non-null int64
rosiglitazone               14000 non-null int64
acarbose                    14000 non-null int64
miglitol                    14000 non-null int64
troglitazone                14000 non-null int64
tolazamide                  14000 non-null int64
examide                     14000 non-null int64
citoglipton                 14000 non-null int64
insulin                     14000 non-null int64
glyburide.metformin         14000 non-null int64
glipizide.metformin         14000 non-null int64
glimepiride.pioglitazone    14000 non-null int64
metformin.rosiglitazone     14000 non-null int64
metformin.pioglitazone      14000 non-null int64
change                      14000 non-null int64
diabetesMed                 14000 non-null int64
dtypes: int64(47)
memory usage: 5.0 MB
In [365]:
#drop or remove these columns since they are not used in any of the cases
score = score.drop('examide', axis=1)
score = score.drop('citoglipton', axis=1)
score = score.drop('glimepiride.pioglitazone', axis=1)

#drop or remove the column 'ID' since this column is not used in the analysis and disply the result
score = score.drop('acetohexamide', axis=1)
score = score.drop('metformin.pioglitazone', axis=1)
score = score.drop('metformin.rosiglitazone', axis=1)
score = score.drop('tolazamide', axis=1)

score = score.drop('tolbutamide', axis=1)
score = score.drop('troglitazone', axis=1)
score = score.drop('chlorpropamide', axis=1)
score = score.drop('glipizide.metformin', axis=1)
score = score.drop('miglitol', axis=1)
score = score.drop('acarbose', axis=1)
score = score.drop('glyburide.metformin', axis=1)
score = score.drop('nateglinide', axis=1)
score = score.drop('repaglinide', axis=1)

score.head(12)
Out[365]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin glimepiride glipizide glyburide pioglitazone rosiglitazone insulin change diabetesMed
0 3 1 4 0 3 11 1 2 7 19 48 4 11 0 0 0 16 14 12 9 0 0 0 0 0 0 0 0 0 0 0
1 3 0 4 0 1 1 7 2 1 0 31 1 28 0 0 0 27 3 17 5 0 0 2 0 0 2 0 2 3 1 1
2 3 1 6 0 6 7 7 1 0 18 42 0 12 0 0 0 23 32 3 3 2 0 0 0 0 2 0 0 0 0 1
3 3 0 7 0 1 1 7 5 0 0 52 2 25 1 1 0 24 12 13 9 0 0 0 0 0 0 0 0 2 1 1
4 3 0 4 0 2 1 7 2 14 55 41 2 3 0 0 0 15 17 2 9 0 0 0 0 0 0 0 0 0 0 0
5 3 1 4 0 5 1 1 5 6 0 1 2 25 1 0 0 32 9 18 5 0 0 2 0 0 2 0 2 2 1 1
6 1 0 8 0 1 3 7 1 7 0 58 1 11 1 0 1 23 13 23 7 0 0 0 0 0 0 0 0 2 0 1
7 3 1 5 0 1 6 7 5 7 0 54 0 13 0 0 0 6 17 26 8 0 0 0 0 0 0 0 0 2 0 1
8 3 0 8 0 2 3 1 3 0 0 48 0 10 0 0 0 17 3 18 7 0 1 0 0 0 2 0 0 2 1 1
9 3 0 8 0 1 3 7 4 1 0 41 0 14 0 0 1 24 18 3 9 0 0 0 0 1 0 0 0 0 1 1
10 3 1 2 0 2 1 2 10 0 0 53 0 20 0 0 0 3 3 3 6 0 0 0 0 0 0 0 0 1 1 1
11 3 1 4 0 2 6 4 6 0 4 48 2 11 0 0 0 10 10 10 9 0 0 0 0 0 0 0 0 2 0 1
In [366]:
score.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14000 entries, 0 to 13999
Data columns (total 31 columns):
race                        14000 non-null int64
gender                      14000 non-null int64
age                         14000 non-null int64
weight                      14000 non-null int64
admission_type_id           14000 non-null int64
discharge_disposition_id    14000 non-null int64
admission_source_id         14000 non-null int64
time_in_hospital            14000 non-null int64
payer_code                  14000 non-null int64
MED_SPEC_NUM                14000 non-null int64
num_lab_procedures          14000 non-null int64
num_procedures              14000 non-null int64
num_medications             14000 non-null int64
number_outpatient           14000 non-null int64
number_emergency            14000 non-null int64
number_inpatient            14000 non-null int64
DIAG_CAT_1                  14000 non-null int64
DIAG_CAT_2                  14000 non-null int64
DIAG_CAT_3                  14000 non-null int64
number_diagnoses            14000 non-null int64
max_glu_serum               14000 non-null int64
A1Cresult                   14000 non-null int64
metformin                   14000 non-null int64
glimepiride                 14000 non-null int64
glipizide                   14000 non-null int64
glyburide                   14000 non-null int64
pioglitazone                14000 non-null int64
rosiglitazone               14000 non-null int64
insulin                     14000 non-null int64
change                      14000 non-null int64
diabetesMed                 14000 non-null int64
dtypes: int64(31)
memory usage: 3.3 MB
In [367]:
# develop a random forest model
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=20)    #building 20 decision trees
clf=clf.fit(X, y)
clf.score(X,y)
Out[367]:
0.99219642857142853

Random Forest = 99.23%

In [368]:
# generate evaluation metrics
print(metrics.accuracy_score(y, clf.predict(X))) #overall accuracy
print(metrics.confusion_matrix(y, clf.predict(X)))
print(metrics.classification_report(y, clf.predict(X)))
0.992196428571
[[49715     0]
 [  437  5848]]
             precision    recall  f1-score   support

          0       0.99      1.00      1.00     49715
          1       1.00      0.93      0.96      6285

avg / total       0.99      0.99      0.99     56000

In [369]:
print("Features sorted by their rank:")
print(sorted(zip(map(lambda x: round(x, 4), clf.feature_importances_), X.columns)))
Features sorted by their rank:
[(0.0058999999999999999, 'weight'), (0.0071000000000000004, 'max_glu_serum'), (0.0077000000000000002, 'rosiglitazone'), (0.0080999999999999996, 'glimepiride'), (0.0091999999999999998, 'diabetesMed'), (0.0094999999999999998, 'pioglitazone'), (0.0109, 'glyburide'), (0.011900000000000001, 'metformin'), (0.0134, 'glipizide'), (0.0143, 'change'), (0.017000000000000001, 'A1Cresult'), (0.017000000000000001, 'number_emergency'), (0.019, 'gender'), (0.021700000000000001, 'number_outpatient'), (0.023400000000000001, 'race'), (0.023599999999999999, 'admission_source_id'), (0.027099999999999999, 'admission_type_id'), (0.028400000000000002, 'insulin'), (0.035000000000000003, 'number_diagnoses'), (0.0361, 'num_procedures'), (0.037600000000000001, 'discharge_disposition_id'), (0.038600000000000002, 'payer_code'), (0.044400000000000002, 'number_inpatient'), (0.044699999999999997, 'MED_SPEC_NUM'), (0.048599999999999997, 'age'), (0.0591, 'time_in_hospital'), (0.065199999999999994, 'DIAG_CAT_3'), (0.0659, 'DIAG_CAT_1'), (0.070900000000000005, 'DIAG_CAT_2'), (0.081199999999999994, 'num_medications'), (0.097600000000000006, 'num_lab_procedures')]
In [370]:
# another method
pd.DataFrame({'feature':X.columns, 'importance':clf.feature_importances_})
Out[370]:
feature importance
0 race 0.023426
1 gender 0.018953
2 age 0.048555
3 weight 0.005906
4 admission_type_id 0.027056
5 discharge_disposition_id 0.037582
6 admission_source_id 0.023568
7 time_in_hospital 0.059076
8 payer_code 0.038612
9 MED_SPEC_NUM 0.044668
10 num_lab_procedures 0.097644
11 num_procedures 0.036111
12 num_medications 0.081217
13 number_outpatient 0.021676
14 number_emergency 0.017004
15 number_inpatient 0.044374
16 DIAG_CAT_1 0.065932
17 DIAG_CAT_2 0.070878
18 DIAG_CAT_3 0.065177
19 number_diagnoses 0.035034
20 max_glu_serum 0.007139
21 A1Cresult 0.017001
22 metformin 0.011851
23 glimepiride 0.008142
24 glipizide 0.013425
25 glyburide 0.010854
26 pioglitazone 0.009453
27 rosiglitazone 0.007738
28 insulin 0.028408
29 change 0.014314
30 diabetesMed 0.009225
In [371]:
#Predict class probabilities for X
clf.predict_proba(X)
Out[371]:
array([[ 0.95,  0.05],
       [ 1.  ,  0.  ],
       [ 0.95,  0.05],
       ..., 
       [ 0.95,  0.05],
       [ 0.95,  0.05],
       [ 0.8 ,  0.2 ]])

Make predictions on the new dataset (scoring dataset without y value)

In [372]:
score.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14000 entries, 0 to 13999
Data columns (total 31 columns):
race                        14000 non-null int64
gender                      14000 non-null int64
age                         14000 non-null int64
weight                      14000 non-null int64
admission_type_id           14000 non-null int64
discharge_disposition_id    14000 non-null int64
admission_source_id         14000 non-null int64
time_in_hospital            14000 non-null int64
payer_code                  14000 non-null int64
MED_SPEC_NUM                14000 non-null int64
num_lab_procedures          14000 non-null int64
num_procedures              14000 non-null int64
num_medications             14000 non-null int64
number_outpatient           14000 non-null int64
number_emergency            14000 non-null int64
number_inpatient            14000 non-null int64
DIAG_CAT_1                  14000 non-null int64
DIAG_CAT_2                  14000 non-null int64
DIAG_CAT_3                  14000 non-null int64
number_diagnoses            14000 non-null int64
max_glu_serum               14000 non-null int64
A1Cresult                   14000 non-null int64
metformin                   14000 non-null int64
glimepiride                 14000 non-null int64
glipizide                   14000 non-null int64
glyburide                   14000 non-null int64
pioglitazone                14000 non-null int64
rosiglitazone               14000 non-null int64
insulin                     14000 non-null int64
change                      14000 non-null int64
diabetesMed                 14000 non-null int64
dtypes: int64(31)
memory usage: 3.3 MB
In [373]:
score.head()
Out[373]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin glimepiride glipizide glyburide pioglitazone rosiglitazone insulin change diabetesMed
0 3 1 4 0 3 11 1 2 7 19 48 4 11 0 0 0 16 14 12 9 0 0 0 0 0 0 0 0 0 0 0
1 3 0 4 0 1 1 7 2 1 0 31 1 28 0 0 0 27 3 17 5 0 0 2 0 0 2 0 2 3 1 1
2 3 1 6 0 6 7 7 1 0 18 42 0 12 0 0 0 23 32 3 3 2 0 0 0 0 2 0 0 0 0 1
3 3 0 7 0 1 1 7 5 0 0 52 2 25 1 1 0 24 12 13 9 0 0 0 0 0 0 0 0 2 1 1
4 3 0 4 0 2 1 7 2 14 55 41 2 3 0 0 0 15 17 2 9 0 0 0 0 0 0 0 0 0 0 0
In [374]:
#score=pd.read_csv("data/Challenge_1_Validation_Work_Clean_NoString.csv")
output_scoring = clf.predict(score)
predicted_y= pd.DataFrame(output_scoring, columns=['Predicted_Readmit_30'])

probs = clf.predict_proba(score)
probs = pd.DataFrame(probs, columns=['Prob of NO', 'Prob of YES'])

readmit_patients = predicted_y.join(probs)

readmit_patients.to_csv("data/output_readmit_RandomForest_ScoringDataset.csv")

readmit_patients.head()
Out[374]:
Predicted_Readmit_30 Prob of NO Prob of YES
0 0 1.00 0.00
1 0 0.95 0.05
2 0 1.00 0.00
3 0 0.95 0.05
4 0 0.70 0.30
In [375]:
#finally ...
data1 = score.join(readmit_patients) 
data1.head(10)
Out[375]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code MED_SPEC_NUM num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient DIAG_CAT_1 DIAG_CAT_2 DIAG_CAT_3 number_diagnoses max_glu_serum A1Cresult metformin glimepiride glipizide glyburide pioglitazone rosiglitazone insulin change diabetesMed Predicted_Readmit_30 Prob of NO Prob of YES
0 3 1 4 0 3 11 1 2 7 19 48 4 11 0 0 0 16 14 12 9 0 0 0 0 0 0 0 0 0 0 0 0 1.00 0.00
1 3 0 4 0 1 1 7 2 1 0 31 1 28 0 0 0 27 3 17 5 0 0 2 0 0 2 0 2 3 1 1 0 0.95 0.05
2 3 1 6 0 6 7 7 1 0 18 42 0 12 0 0 0 23 32 3 3 2 0 0 0 0 2 0 0 0 0 1 0 1.00 0.00
3 3 0 7 0 1 1 7 5 0 0 52 2 25 1 1 0 24 12 13 9 0 0 0 0 0 0 0 0 2 1 1 0 0.95 0.05
4 3 0 4 0 2 1 7 2 14 55 41 2 3 0 0 0 15 17 2 9 0 0 0 0 0 0 0 0 0 0 0 0 0.70 0.30
5 3 1 4 0 5 1 1 5 6 0 1 2 25 1 0 0 32 9 18 5 0 0 2 0 0 2 0 2 2 1 1 0 0.85 0.15
6 1 0 8 0 1 3 7 1 7 0 58 1 11 1 0 1 23 13 23 7 0 0 0 0 0 0 0 0 2 0 1 0 0.95 0.05
7 3 1 5 0 1 6 7 5 7 0 54 0 13 0 0 0 6 17 26 8 0 0 0 0 0 0 0 0 2 0 1 0 0.95 0.05
8 3 0 8 0 2 3 1 3 0 0 48 0 10 0 0 0 17 3 18 7 0 1 0 0 0 2 0 0 2 1 1 0 0.85 0.15
9 3 0 8 0 1 3 7 4 1 0 41 0 14 0 0 1 24 18 3 9 0 0 0 0 1 0 0 0 0 1 1 0 0.95 0.05
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: